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ABSTRACT 

OPD  is  a  set  of  four  co-ordinated  synthesis  and  analysis  tools  for  the 
design  of  optimized  VLSI  datapath  and  CPU  pipelines.  Together,  these 
tools  cover  a  wide  range  of  design  tasks,  from  functional  partitioning  of  the 
system  into  pipeline  stages  through  datapath  definition  and  clocking,  to  the 
handling  of  technology-specific  constraints. 

OPD  has  tools  for  stage  partitioning,  clocking  scheme  calculation, 
datapath  sequencing,  and  pipeline  initiation  scheduling.  We  describe  these 
tools  as  well  as  the  optimization  algorithms  they  use.  We  discuss  both  pro¬ 
babilistic  and  heuristic  optimization  techniques. 

We  show  how  it  is  possible  to  rapidly  design  high-quality  pipelines  by 
using  OPD  with  existing  CAD  tools  such  as  logic  synthesizers.  We  show 
large  as  well  as  small  examples  taken  from  VLSI  chips  and  discrete  logic 
machines. 
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Section  I  -  Project  Overview 

We  examine  why  it  is  desirable  to  have  a  set  of  tools  for  Optimal  Pipeline  Design,  and  show 
how  the  sub-tools  of  OPD  work  together  and  with  existing  tools  such  as  logic  synthesizers. 
We  also  compare  OPD  to  similar  tools. 

1.  Motivation 

A  pipelined  design  is  often  the  best  way  to  achieve  high  performance  at  a  reasonable  cost. 
Pipelining  a  system  dramatically  enhances  its  performance  for  a  relatively  modest  increase 
in  cost.  This  is  why  most  digital  machines  today  are  pipelined.  However,  as  Fig.l  shows, 
many  complex  and  inter-dependent  steps  are  required  to  design  a  good  pipe.  OPD  provides 
a  tool  set  to  assist  the  designer  in  each  one  of  the  steps  in  Fig.l,  with  the  exception  of  step 

2.  The  objective  of  OPD  is  to  help  designers  produce  better  pipes  in  less  time. 

We  use  two  primary  quantities  to  measure  the  quality  of  a  pipeline  design.  The  first 
is  throughput,  or  average  flow  rate;  this  indicates  the  processing  speed  of  the  system.  The 
second  is  the  complexity  of  the  datapath  and  of  the  logic  required  to  control  the  datapath. 
It  is  important  to  know  this,  since  complex  logic  implies  longer  delays  and  increased  chip 
area.  The  synthesis  tools  of  OPD  aim  to  maximize  throughput  and  minimize  complexity, 
while  the  analysis  tools  rapidly  calculate  these  two  quantities. 

Fig.l  shows  how  the  system  specifications  evolve  through  various  levels  as  the  design 
progresses.  Initially,  we  know  the  behavioral  specification  [Blackburn  85]  of  the  system, 
which  defines  the  system’s  function.  The  first  step  is  to  split  this  function  into  a  set  of 
overlapping  subfunctions  that  will  define  the  pipeline  stages.  The  second  step  is  to  choose  a 
detailed  datapath  that  will  implement  the  system’s  functions.  This  leads  to  a  register- 
transfer  level  (RTL)  [Snow  78]  description  of  the  system.  This  step,  and  only  this  step,  is 
entirely  left  to  the  designer.  OPD  is  not  a  datapath  synthesizer;  however,  it  can  quickly 
analyze  the  speed  of  a  proposed  datapath.  Of  course,  this  second  step  depends  very  much 
on  the  target  technology.  The  third  step  is  to  determine  a  clocking  scheme  to  optimally 
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FIG  1:  PIPE  DESIGN  STEPS 
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sequence  the  above  set  of  stages,  using  the  datapath  emerging  from  step  2.  This  step 
involves  choosing  the  number  of  clock  phases  and  their  lengths.  The  fourth  step  is  to  deter¬ 
mine  on  which  clock  cycle  and  phase  each  gate  in  the  datapath  should  be  run  on.  The  fifth 
step  is  to  choose  an  initiation  sequence  for  the  above  datapath.  This  tells  us  exactly  when 
new  data  should  be  allowed  to  enter  the  pipeline  to  maximize  processing  throughput. 

These  steps  are  complex  and  time-consuming  if  performed  manually.  Furthermore, 
they  are  highly  inter-related,  and  thus  require  a  great  amount  of  discipline  to  maintain 
overall  consistency  of  the  design  decisions.  The  first  step,  stage  partitioning,  requires  the 
exploration  of  a  large  number  of  alternatives.  The  third  step,  clocking  scheme  determina¬ 
tion,  depends  on  the  behavioral  stage  partitioning  as  well  as  the  detailed  datapath  imple¬ 
mentation.  It  is  hard  to  keep  track  of  both  of  these  manually.  The  fourth  and  fifth  steps, 
which  deal  with  detailed  datapaths,  are  especially  complex  in  the  case  of  MOS-VLSI  [Mead 
80]  [Weste  85]  systems,  since  there  are  many  factors  to  be  considered  during  these  steps. 
One  such  factor  is  the  use  of  complex  multiphase  clocking  schemes  ([Weste  85]  chap.  5.4). 
These  offer  great  flexibility,  but  complicate  the  designer’s  task  by  introducing  additional 
degrees  of  freedom.  Similarly,  the  use  of  transparent  latches  as  pipeline  staging  elements 
([Weste  85]  ch  5.4  and  [Kogge  81]  ch  2)  makes  it  possible  to  effectively  trade  time  across 
pipe  stages  -  a  fast  stage  can  make  up  for  a  slow  one.  However,  the  designer  must  keep 
track  of  signal  delays  across  these  latches.  Other  factors  to  be  kept  in  mind  are  possible 
constraints  that  acceptable  phase  assignments  must  satisfy.  Such  constraints  can  result 
from  external  signals,  or  can  arise  because  part  of  the  phase  assignment  has  already  been 
done,  because  gates  are  shared  across  pipe  stages,  because  of  precharging  schemes,  or 
because  of  layout  area  constraints  that  limit  the  number  of  control  lines. 

OPD  has  the  ability  to  take  into  account  all  the  detailed  constraints  found  at  the 
register-transfer  level  when  assigning  clock  phases  to  gates  and  latches  in  the  datapath. 
OPD  does  not  stop  at  the  behavioral  level,  where  the  system  description  is  usually  rough 
and  approximate;  it  goes  from  the  behavioral  level  right  down  to  the  final  datapath  blocks. 
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Moreover,  OPD  contains  features  that  make  it  suitable  both  for  VLSI  datapath  and  CPU 
pipeline  design  tasks. 

2.  OPD  Description  and  Usage  Scenario 

OPD  consists  of  a  set  of  four  tools,  that  cover  the  design  levels  from  system  behavior  to 
datapath  and  logic  synthesis.  These  tools  work  primarily  in  a  top-down  fashion,  where 
behavioral  design  tasks  are  performed  before  layout  related  optimization.  However,  OPD 
runs  fast,  which  makes  it  possible  to  iterate  each  design  step  many  times.  Moreover,  each 
tool  within  OPD  accepts  a  rich  set  of  constraints  that  can  be  used  to  feed  OPD  with  infor¬ 
mation  gained  in  previous  design  cycles.  Fig.2  shows  how  these  tools  can  be  used  together 
and  with  existing  tools,  such  as  logic  synthesis  routines.  We  now  describe  each  tool  briefly. 
We  will  describe  each  tool  in  detail  later;  our  objective  here  is  to  show  how  the  tools  work 

together. 

•  SP:  this  is  the  Stage  Partitioning  tool.  Fig. 3  shows  an  example  of  what  SP  does.  The 
input  and  output  files  for  this  example  are  in  Appendix  1. 

The  input  to  SP  consists  of  three  items.  The  first  is  a  set  of  dataflow  graphs  [Snow  78] 
[Park  85]  [Parker  85]  that  describe  each  one  of  the  functions  the  system  is  to  execute  (in 
the  case  of  a  CPU,  each  graph  might  correspond  to  one  instruction).  The  nodes  of  these 
graphs  correspond  to  simple,  atomic  operations  (such  as  an  ALU  operation),  and  each 
node  has  an  attached  delay.  Arcs  show  where  each  operation  gets  its  arguments  from, 
and  where  it  sends  its  results  to;  each  arc  has  an  attached  bit-width.  A  probability  of 
occurrence  is  attached  to  each  dataflow  graph  that  shows  how  frequently  that  particular 
function  will  be  executed.  The  second  input  item  is  a  description  of  shared  resources  and 
resource  constraints.  The  third  is  the  target  pipeline  cycle  time.  Given  these,  SP  will 
show  where  the  staging  latches  should  be  placed  in  order  to  achieve  the  target  stage 
length  and,  as  a  secondary  objective,  to  minimize  the  number  of  bits  of  staging  latches. 

•  PC:  this  is  the  Phase  Calculation  tool.  Fig.4  shows  an  example  of  what  PC  does. 

PC  takes  three  input  items.  The  first  is  the  partitioned  set  of  graphs  from  SP  along  with 
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FIG  3:  AN  SP  ILLUSTRATION 
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FIG  4:  A  PC  ILLUSTRATION 
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the  longest  stage  time  (cycle  time).  The  second  is  the  target  phase  length.  The  third  is  a 
description  of  the  datapath  that  will  implement  these  dataflow  graphs  (provided  by  the 
designer)  along  with  all  the  datapath-level  constraints  to  suit  the  target  technology;  PC 
will  then  split  up  the  cycle  into  a  set  of  phases  whose  length  will  be  as  close  as  possible 
to  the  target  phase  length,  and  such  that  all  the  datapath  constraints  can  be  met  and  the 
longest  stage  time  is  minimized.  PC  therefore  breaks  up  the  clock  cycle  chosen  by  SP 
into  a  set  of  phases  that  are  well  suited  to  the  final  datapath.  The  datapath  is  described 
as  a  DAG;  the  nodes  correspond  to  hardware  operators,  and  the  arcs  to  signal  nets. 
There  is  also  a  set  of  constraints  associated  with  the  datapath.  The  purpose  of  these  con¬ 
straints  is  to  capture  technological  limitations  on  acceptable  phase  sets.  Such  limitations 
might  arise  from  particular  layout  or  clocking  schemes.  The  datapath  specification  is 
described  fully  in  Appendix  2. 

PA:  this  is  the  Phase  Assignment  tool.  Fig.5  shows  an  example  of  what  PA  does.  PA 
determines  on  which  phase/cycle  each  one  of  the  datapath  gates  and  latches  should  be 
clocked  in  order  to  satisfy  the  target  technology  constraints  and  to  maximize  pipe 
throughput. 

The  input  to  PA  consists  of  four  items.  The  first  is  a  description  of  datapath  blocks  used 
to  implement  the  system.  The  second  is  a  set  of  constraints  that  the  resulting  Phase 
Assignment  has  to  follow  in  order  to  be  suitable  for  the  target  technology.  The  third  and 
fourth  items  are  the  partitioning  from  SP  and  phase  lengths  from  PC.  PA  will  then 
determine  on  which  phase/cycle  each  one  of  the  datapath  blocks  should  be  clocked.  PA 
uses  a  more  precise  optimization  algorithm  than  PC.  PA  produces  a  reservation  table 
that  shows  exactly  how  long  each  stage  will  be  in  the  final  design. 

SCHED:  The  reservation  table  scheduler.  Fig, 6  shows  an  example  of  what  SCHED  does. 
The  input  to  SCHED  is  the  reservation  table  produces  by  PA.  SCHED  will  then  deter¬ 
mine  when  new  data  should  be  allowed  to  enter  the  pipeline  in  order  to  maximize  the 
overall  processing  throughput.  The  output  from  SCHED  is  an  optimal  initiation 
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FIG  5:  PA  ILLUSTRATION 
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sequence  [Kogge  81]  for  the  pipe. 

The  output  from  SCHED  fully  defines  when  each  control  signal  will  have  to  be  activated; 
we  can  feed  this  information  to  logic  synthesis  routines  such  as  MIS  [Brayton  86], 
ESPRESSO  [Brayton  83]  or  ESPRESSO-ML  [Brayton  84]  or  [Leive  81]  to  generate  the  pipe¬ 
line  controller  automatically.  We  will  show  some  examples  of  how  this  can  be  done  semi- 
automatically  for  two  pipeline  control  strategies.  However,  we  have  not  built  a  tool  to  gen¬ 
erate  the  pipeline  controller  in  a  fully  automatic  fashion.  This  is  because  there  are  many 

different  pipeline  control  styles. 

During  the  initial  design  stages,  we  can  profitably  use  estimator  functions  [Kurdahi 
85]  to  get  approximate  delay  figures  for  each  node  in  the  behavioral  graphs.  These  estima¬ 
tors  use  factors  such  as  bit-width  and  type  of  operation  (integer  versus  floating  point,  for 
instance)  to  provide  delay  estimates  that  can  be  used  for  comparative  purposes. 

It  was  decided  not  to  include  automatic  datapath  synthesis  into  OPD.  Automatic 
module  selection  for  datapath  design  is  a  very  complex  task;  most  datapath  synthesis  rou¬ 
tines  [Park  85]  [Park  85b]  [Parker  85]  [Thomas  83]  work  mainly  with  the  approximate 
information  available  in  the  system’s  behavioral  description.  These  routines  therefore  have 
difficulty  taking  into  account  all  the  constraints  that  result  from  target  technology  details. 
As  a  result,  most  such  programs  produce  technology-independent  datapaths  that  can  be 
very  far  from  optimal  when  they  are  mapped  into  a  particular  target  technology.  This  does 
not  fit  with  OPD’s  "toolbox”  approach;  we  take  the  "vertical”  route;  OPD  has  tools  that 

remain  useful  down  to  the  layout  level. 

In  summary,  OPD  is  a  sequence  of  tools  that  perform  progressively  more  detailed 
optimization  tasks  on  pipelines.  The  higher-level  tool  SP  has  a  very  simplistic  view  of  the 
system;  the  later  tools  (PA,  SCHED)  take  into  account  more  detailed  information.  Of 
course,  these  various  optimization  tasks  are  inter-dependent.  For  instance,  we  want  to  do 
SP  such  that  when  the  datapath  is  finally  scheduled  using  SCHED,  the  throughput  will  be 


maximal. 
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To  handle  these  inter-dependencies,  each  tool  has  a  simplified  view  of  what  the  next 
tool  will  do;  it  uses  this  simplified  view  to  perform  optimization  at  its  own  level.  For  exam¬ 
ple,  SP  hopes  that,  by  minimizing  the  longest  stage  length,  PC  will  be  able  to  find  phases 
that  lead  to  a  faster  pipe.  In  fact,  the  cost  function  for  SP  could  be  a  simplified  version  of 
PC.  The  cost  function  for  PC  is  actually  a  simplified  version  of  PA.  Of  course,  SCHED 
knows  the  exact,  final  pipe  speed. 

3.  OPD  Context  and  Related  Tools 

We  show  how  OPD  relates  to  and  differs  from  similar  tools.  We  are  mainly  concerned  with 
datapath  synthesis  systems  and  clocking  scheme  synthesizers. 

Datapath  synthesizers  take  a  behavioral  description  of  a  system,  perhaps  augmented  by 
performance  and  cost  constraints,  and  produce  a  register-transfer  level  structure  that 
implements  this  behavior.  [Blackburn  85]  provides  an  overview  of  one  such  set  of  tools,  the 
CMU  Design  Automation  System.  A  datapath  synthesizer  is  also  described  in  [Parker  85], 
Fig. 7,  from  [Snow  78],  shows  the  strategy  used  by  the  CMU/DA.  OPD  and  the  datapath 
synthesizers  are  complementary  tools.  OPD  can  use  a  synthesizer  to  map  the  partitioned 
stage  graph  into  a  register-transfer  level  datapath.  From  then  on,  OPD  could  perform 
phase  length  calculation  and  phase  assignment  using  the  synthesized  datapath.  A  possible 
pitfall  of  this  approach  is  that  the  synthesis  program  needs  to  know  quite  a  lot  about  the 
target  technology  and  the  pipeline  structure  to  produce  a  good  datapath.  The  pipelining 
and  datapath  synthesis  tasks  are  inter-dependent. 

Clocking  scheme  synthesizers,  such  as  [Park  85]  and  [Park  85b],  decide  h.ow  to  pipe¬ 
line  the  datapath  and  calculate  the  optimal  number  of  phases  per  clock  cycle,  as  well  as  the 
length  of  each  phase.  However,  these  tools  work  at  a  fairly  high  level  of  abstraction,  where 
the  details  of  the  final  datapath  are  still  unknown.  As  a  result,  the  phases  produced  by 
these  tools  tend  to  match  the  structure  of  the  pipeline  stages.  This  is  not  what  we  want  in 
the  case  of  a  MOS-VLSI  pipe.  We  want  the  phases  to  match  the  interconnection  of  the 
gates  and  the  precharging  and  bussing  schemes.  Fig.8  and  Fig.9  show  a  phase  calculation 
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FIG  7:  CMU  /  DA  OVERVIEW 


ISPS  SYSTEM  SPECIFICATION 


V 
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(TAKEN  FROM  [SNOW  78]) 
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found  in  [Park  85],  Each  one  of  the  two  gTaphs  in  Fig.8  represents  one  instruction  that  the 
system,  an  HP21-MX  CPU,  is  able  to  execute.  Each  phase  corresponds  to  one  stage  latch; 
phases  serve  to  clock  pipe  stages.  Fig.  10  shows  what  OPD  means  by  a  phase;  a  phase  is 
used  to  clock  gates  within  a  stage,  as  is  commonly  done  in  MOS  designs.  OPD  needs 
detailed  datapath  information,  in  the  form  of  constraints  on  the  possible  phase  assign¬ 
ments,  in  order  to  calculate  these  MOS-phases. 

4.  Software  Organization 

The  software  is  organized  as  a  set  of  independent  programs  which  communicate  via  shared 
files.  The  table  below  gives  some  statistics.  The  more  time-critical  routines  are  in  the 
language  C  [Kernighan  78],  while  Franz  LISP  [Franz  86]  is  used  everywhere  else. 


Function 

Language 

approx  #lines 
(excluding  test  code) 

Stage  Partition 
and  Phase  Calculation 

Franz  LISP 

2000 

Preprocessing  for 

Phase  Assignment 

Franz  LISP 

1500 

Phase  Assignment 
Heuristic  &  Annealing 

C 

3000 

Reserv.  table  Scheduler 

C 

700 

The  next  chapter  describes  the  Stage  Partitioning  and  Phase  Calculation  tools. 


FIG  8:  HP21-MX  CPU  DATA  FLOW  GRAPH 
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FIG  9:  HP21-MX  PHASES  ACCORDING  TO  [PARK  85] 
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Section  II  -  Stage  Partitioning  and  Phase  Calculation 

In  this  chapter,  we  define  the  Optimal  Stage  Partitioning  problem  (SP),  as  well  as  the 
Optimal  Phase  Calculation  (PC)  problem.  We  use  the  Microcode  Compaction  Problem 
[Fisher  81]  to  show  that  SP  is  NP-complete.  We  describe  the  heuristic  algorithms  used  to 
perform  the  SP  task,  and  finally  show  how  the  same  algorithms  can  be  used  to  perform  PC. 
We  also  present  other  possible  approaches  for  solving  SP. 

1.  Problem  Specification  for  SP 

1.1.  Input 

The  input  to  SP  consists  of  a  set  of  dataflow  graphs  ("traces”)  with  an  attached  probability 
of  occurrence,  plus  a  set  of  resource  constraints  and  a  target  stage  length.  Each  graph 
represents  one  of  the  operations  or  instructions  that  the  system  is  to  execute.  The  attached 
probability  represents  how  often  that  particular  operation  is  expected  to  occur.  For 
instance,  in  a  CPU,  there  might  be  one  graph  for  load/store  instructions,  another  for  arith¬ 
metic,  and  another  for  branches.  The  attached  probabilities  are  the  relative  dynamic 
instruction  frequencies.  Specifically,  the  input  consists  of  the  following  elements. 
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Nl(i  =  l,..,NN) 


RTJ(j  =  l,-,NRT) 


Tk(k=l,..,NT ) 


PTk(k  =  l,..,NT ) 


TARG 


A  set  of  AW  dataflow  nodes.  A  node  corresponds  to  an  operation;  a 
node  can  optionally  have  an  attached  resource-type  RTj  used  to 
specify  resource  constraints.  For  instance,  in  a  CPU,  the  dataflow 
node  corresponding  to  arithmetic  instructions  would  have  ALU  as  an 
associated  resource  type.  This  models  the  fact  that  arithmetic  in¬ 
structions  require  the  ALU  (or  one  of  the  ALUs  if  there  are  many). 

A  set  of  resource-types;  each  resource-type  has  an  associated  number 
called  the  "limit”.  A  resource-type  corresponds  to  a  class  of 
hardware  units;  the  "limit”  is  the  number  of  available  units  of  that 
type.  For  instance,  we  may  have  a  resource-type  "ADDER”  with  a 
"limit”  of  2.  This  means  that  not  more  than  2  "ADD”  dataflow 
nodes  can  be  simultaneously  scheduled. 

A  list  of  "traces”.  Each  trace  is  a  directed  acyclic  dataflow  graph  on 
a  subset  of  the  above  set  of  nodes.  Each  arc  of  each  trace  has  an  as¬ 
sociated  bit- width.  Each  trace  corresponds  to  one  of  the  operations 
that  the  system  will  have  to  execute. 

A  probability  figure  attached  to  each  trace.  PTk  is  the  probability, 
or  relative  frequency,  with  which  the  system  will  be  expected  to  exe¬ 
cute  the  operation  corresponding  to  Tk. 

The  target  stage  length 


Traditional  Data  Flow  Graph  input  descriptions  need  to  have  a  way  to  describe  that  certain 
operations  are  mutually  exclusive  in  the  system.  Fig.10.bis  shows  the  problem;  if  we  did 
not  specify  that  nodes  A  and  E  are  never  simultaneously  active,  the  stage  partitioner  would 
(falsely)  deduce  that  the  critical  path  in  the  system  is  A-C-E.  This  path  can  never  occur. 
Our  system  description  avoids  this  problem  altogether  since  we  only  specify  those  dataflow 
graphs  that  corresponds  to  actual,  possible  instructions.  For  the  example  of  Fig.10.bis,  we 
would  describe  the  system  by  giving  two  dataflow  graphs,  each  one  corresponding  to  a  real 
operation.  The  first  graph  would  be  A-C-D,  the  second  B-C-E.  OPD  will  trace  these  graphs 
independently,  and  avoid  the  false  path  A-C-E  altogether.  We  therefore  do  not  need  a  spe¬ 
cial  dataflow  node  to  describe  mutual  exclusivity  in  OPD. 


1.2.  Optimization  Objective 

The  objective  is  to  partition  the  given  set  of  nodes  into  consecutive  pipeline  stages  such 
that  the  length  of  each  stage  is  minimal  and  less  than  TARG  and,  secondarily,  so  as  to 
minimize  the  sum  of  the  bit-widths  of  the  arcs  that  are  cut  by  stage  latches.  This 


FIG  10. BIS:  MUTUAL  EXCLUSION 
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minimizes  the  number  of  stage  latch  bits.  In  fact,  SP  first  fixes  the  number  of  stages,  then 
moves  the  stage  latches  around  (described  below). 

The  cost  function  we  use  is  equal  to  the  length  of  the  longest  stage  plus  a  small  factor 
times  the  total  number  of  stage  latch  bits.  The  length  of  each  stage  is  determined  by 
scheduling  the  nodes  that  are  in  that  stage  so  as  to  respect  the  precedence  constraints 
specified  by  the  "traces”  and  so  as  to  respect  the  resource  limit  constraints  specified  by  the 
RTk. 

We  also  take  into  account  resource  sharing  between  nodes  in  different  stages.  For 
instance,  assume  two  stages  Si  and  S2  both  have  ADD  operations;  Si  has  N1  ADDs,  while 
S2  has  N2  ADDs.  Furthermore,  assume  there  are  only  NADDER  "ADDERs”  available  in 
the  system.  We  would  then  set  the  stage  lengths  for  both  Si  and  S2  to  be  greater  than  or 

eaual  to  TMAX  =  This  is  because  TMAX  is  the  time  it  will  take  to  get  data 

equal  10  imaa  LADDER 

through  both  these  stages;  the  possibility  to  overlap  Si  and  S2  will  be  reduced  due  to  this 
resource  sharing.  Our  calculation  does  assume,  however,  that  some  overlap  between  Si 
and  S2  will  occur  so  as  to  keep  the  shared  resources  busy  all  the  time.  In  other  words,  we 
take  the  best-case  impact  of  resource  sharing  on  the  stage  length  term:  we  assume  that  the 
shared  resources  will  always  be  kept  busy. 

The  above  cost  function  makes  the  slowest  trace  determine  the  cycle  time.  For 
instance,  in  a  CPU,  this  means  that  the  slowest  instruction  determines  the  machines  cycle 
time.  Another  possible  choice  is  to  calculate  the  longest  stage  time  for  each  trace  .  We 
could  then  use  the  weighted  average  of  these  per-trace  longest  times  as  the  cost.  Fig.  11 
shows  the  difference  between  these  two  cost  functions.  In  order  to  benefit  from  the  possibil¬ 
ity  of  having  different  cycle  times  for  each  trace,  we  need  a  controller  which  can  dynami¬ 
cally  choose  the  proper  cycle  time  according  to  the  operation  the  system  is  executing. 
Examples  of  such  machines  are  the  PDP11/34  and  the  PDP11/40.  These  machines  had 
micro-engines  that  could  choose  between  two  or  three  possible  micro-cycle  times  '[SBN  82] 
ch.  34)  in  a  dynamic  fashion,  according  to  the  particular  micro-instruction  being  executed. 


FIG  11:  WORST-CASE  VERSUS  PER-TRACE  CYCLE  TIMES 
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However,  OPD  uses  the  simpler  worst-case  stage  length  cost. 

The  output  from  SP  is  therefore  a  list  of  stage  latches  that  completely  partition  each 
one  of  the  input  traces  into  disjoint  stages.  Each  latch  therefore  consists  of  a  cut-set  of  arcs 
for  each  input  trace.  Appendix  1  describes  the  input  and  output  format  used  by  SP,  and 
shows  two  examples:  a  small  test  example,  and  the  HP21-MX  CPU  graph  taken  from  [Park 
85]. 

1.3.  Skewed  Stages 

SP  also  has  the  capability  of  inserting  stage-latches  that  do  not  form  cut-sets  for  all  the 
input  graphs.  This  feature  is  optional,  since  inserting  such  latches  will  lead  to  a  "skewed” 
pipeline.  In  such  a  pipeline,  the  total  number  of  clock  cycle  required  for  a  data  item  to 
travel  through  all  the  stages  depends  on  the  data  item  -  it  may  require  one  or  more  extra 
cycles  if  it  hits  the  non-cutset  latches.  Fig.  12  shows  a  "skewed”  pipeline  example  inspired 
from  a  very  common  micro-engine  design.  Micro-engines  (and  RISC  CPUs)  are  often 
skewed  pipelines;  that  is,  some  data  takes  longer  than  others  to  filter  through  the  engine. 
For  instance,  a  branch  based  on  condition  codes  will  take  two  micro-cycles  to  get  executed; 
a  normal  ALU  operation  will  take  only  one.  Any  system  that  supports  a  delayed  branch  is, 
in  effect,  a  skewed  pipeline. 

Skewed  pipes  are  useful  when  we  want  to  have  a  small  number  of  stages  (to  make  it 
easier  to  handle  data  dependencies  for  example)  and  at  the  same  time  keep  each  stage  very 
short.  The  penalty  we  pay  is  that  "rare’  operations  will  execute  with  an  extra  delay  (like 
branches).  In  the  case  of  Fig.12,  we  do  not  want  to  have  more  than  two  stages  (one  pipe 
latch)  because  of  data  dependencies.  The  un-skewed  pipe  has  a  cycle  time  of  130ns;  the 
skewed  version  will  execute  instructions  in  an  "average’  of  110ns  (the  usual  case  executes 
in  100ns;  branches  require  200ns  but  only  happen  10%  of  the  time).  Of  course,  if  condi¬ 
tional  branches  were  more  common,  the  skewed  pipe  would  be  the  wrong  choice.  It  is 
better  to  a  have  a  single-cycle  conditional  branch  if  there  are  many  of  them,  even  if  this 
means  increasing  the  cycle  time  slightly. 


FIG  12:  A  SKEWED  PIPELINE 
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Another  possibility  is  to  sequence  this  skewed  pipeline  with  a  very  short  cycle  time. 
Each  instruction  can  then  use  the  minimum  number  of  cycles  required  for  its  execution. 
For  the  example  of  Fig.12,  we  could  have  a  33ns  cycle  time.  Non-branch  instructions  exe¬ 
cute  in  3  cycles,  while  branches  require  four.  In  effect,  by  using  a  short  and  well  chosen 
cycle  time,  we  have  finer  control  over  the  timing  of  data  flow  through  the  pipeline.  In 
Fig.12,  a  cycle  of  33ns  was  chosen  because  the  node  delays  are  100ns  and  30ns.  A  value  of 
33ns  represents  some  form  of  greatest  common  divisor. 

OPD  does  not  calculate  such  short  "GCD”  cycle  lengths  automatically.  However,  once 
the  designer  has  chosen  a  suitable  cycle  time,  this  can  be  fed  to  OPD.  OPD  can  then  break 
the  cycle  up  into  phases  (if  required),  determine  how  many  cycles  each  instruction  will 
require,  and  produce  the  reservation  table  for  this  pipeline.  Finally,  OPD  can  calculate  an 
optimal  initiation  sequence  for  this  table. 

The  last  task  -  calculating  an  optimal  initiation  sequence  -  can  become  quite  complex 
with  short  cycle  times.  This  is  because  the  reservation  table  will  spread  over  a  great  many 
cycles,  and  will  have  a  rather  irregular  pattern.  Short  cycle  times  will  therefore  usually 
require  more  involved  pipeline  control  logic.  OPD  helps  the  designer  specify  and  synthesize 
this  logic. 

1.4.  Related  Algorithms  and  Problems 

We  show  that  the  Microcode  compaction  problem  (MC)  [Fisher  81]  is  a  subtask  of  SP.  The 
objective  of  MC  is  to  pack  a  sequence  of  elementary  micro-operations  into  horizontal  micro¬ 
instructions  so  as  to  minimize  the  total  execution  time  of  the  sequence  by  exploiting  the 
parallelism  available  in  the  micro-engine. 

In  order  to  calculate  the  length  of  a  particular  stage  under  resource  limitations,  SP 
has  to  pack  the  operations  (nodes)  in  that  stage  so  as  to  minimize  the  total  stage  time  while 
satisfying  the  resource  constraints.  In  order  to  calculate  the  length  of  a  particular  pipe 
stage,  SP  therefore  has  to  perform  MC  for  the  operations  within  that  stage.  MC  is  there- 
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fore  a  subtask  of  SP.  [Fisher  81]  shows  that  MC  is  NP-complete;  he  does  this  by  showing 
that  a  constrained  multiprocessor  scheduling  problem,  known  to  be  NP-complete  [Coffman 
76],  is  reducible  to  MC.  It  then  follows  that  MC,  and  therefore  SP,  are  NP-complete.  We 
now  describe  the  algorithms  used  for  SP. 

2.  Optimization  Algorithms  for  SP 

SP  is  done  in  three  steps;  the  first  step,  optO,  creates  a  starting-point  stage  partitioning; 
the  next  two  steps,  optl  and  opt2,  iteratively  optimize  this  partitioning  to  reduce  the  max¬ 
imal  stage  length  and  the  number  of  stage  latch  bits. 

OptO  relies  on  the  delays  of  the  nodes  in  each  stage  to  estimate  the  stage  length. 
Optl  and  opt2  use  a  more  precise  cost  function  that  actually  schedules  the  nodes  in  each 
stage  to  calculate  the  stages  length. 

2.1.  Cost  Function 

The  cost  function  is  made  up  of  three  terms:  the  length  of  the  longest  pipe  stage,  an  extra 
term  (called  the  share-term )  to  account  for  resource  sharing  between  stages,  and  a  small 
factor  times  the  number  of  stage  latch  bits. 

In  order  to  find  the  length  of  each  stage  in  the  presence  of  shared  resources,  it  is 
necessary  to  schedule  the  operations  of  that  stage  while  satisfying  the  resource  limits.  We 
use  forward-urgency  scheduling  [Park  85]  for  this.  Whenever  two  independent  operations 
compete  for  a  shared  resource,  the  operation  that  is  farther  from  the  end  of  the  stage  gets 
the  resource  first.  The  distance  of  a  node  from  the  end  of  the  stage  is  the  length  of  the 
longest  path  from  that  node  to  the  output  stage  latch.  We  know  how  far  each  operation  is 
from  the  end  of  the  stage  by  calculating  the  delay  from  that  operation’s  node  to  the  stage 
latch.  Fig. 13  shows  the  forward-urgency-sched  procedure  along  with  an  example. 

2.2.  OptO  Seed  Algorithm 

OptO  takes  a  target  stage  length  as  argument,  and  packs  blocks  into  successive  stages  that 


29 


FIG  13:  FORWARD-URGENCY  SCHEDULING  ALGORITHM 
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are  each  shorter  than  the  target  length.  OptO  iteratively  proceeds  from  the  graph’s  root 
nodes  (those  with  no  fanin)  towards  the  graph’s  tail  nodes  (those  with  no  fanout),  creating  a 
new  stage  whenever  the  target  length  is  reached.  Fig.  14  shows  how  optO  works. 

2.3.  Optl  Optimization 

Optl  takes  an  existing  stage  partition  and  attempts  to  iteratively  improve  it;  it  does  so  by 
trying  to  move  single  nodes  from  one  stage  to  the  next  or  to  the  previous  stage.  After  try¬ 
ing  to  move  every  boundary  node,  the  best  resulting  partition  is  picked.  Fig.15  shows  an 
example  of  optl. 

2.4.  Opt2  Optimization 

Opt2  also  takes  an  existing  stage  partition  which  it  iteratively  improves;  it  does  so  by  an 
algorithm  based  on  Kernighan  and  Lin’s  bipartition  exchange  algorithm  [Kernighan  70]. 
Opt2  will  successively  move  every  operation  node  N j  through  Nk,  located  at  a  stage  boun¬ 
dary,  to  the  next  stage,  it  will  then  pick  the  best  point  k—kO  such  that  the  stage  length 
reached  by  moving  Nx  through  Nh0  is  minimal  with  respect  to  k.  Opt2  then  repeats  this 
procedure,  moving  boundary  nodes  to  the  previous  stage  this  time.  Fig.16  shows  an  exam¬ 
ple  of  opt2.  [Devadas  87]  shows  another  application  of  the  Kernighan  and  Lin  algorithm  to 
synthesis  problems. 

The  reason  we  have  three  algorithms  is  as  follows.  OptO  finds  a  starting 
configuration.  Optl  performs  optimization  with  respect  to  local  motion  of  nodes  across 
stage  boundaries.  As  such,  optl  typically  gets  stuck  in  configurations  that  correspond  to 
local  minima  for  the  cost  function.  On  the  other  hand,  opt2  attempts  more  drastic  changes 
to  the  pipeline  structure.  Opt2  has  the  potential  to  get  the  search  out  of  the  kind  of  local 
minima  that  block  optl. 

Therefore,  we  would  normally  call  optO  first.  We  would  then  iteratively  call  optl 
until  a  local  minimum  is  found.  At  this  point,  we  call  opt2  to  escape  from  the  local 
minimum,  and,  hopefully,  find  a  more  promising  configuration.  We  iterate  optl  on  the  new 
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FIG  14:  OPTO  OPTIMIZATION  ALGORITHM 


PROCEDURE  OPTO; 

INPUT:  TARG  (TARGET  STAGE  LENGTH)  +  G  (GRAPH  TO  PARTITION) 


OPTO  PARTITIONS  G  INTO  A  SET  OF  CONSECUTIVE  STAGES; 
THE  LENGTH  OF  EACH  STAGE  IS  <=  TARG. 


OPTO  STARTS  OUT  AT  THE  ROOT  NODES  (THOSE  WITH  NO  FANIN)  AND 
ITS  WAY  TO  THE  TAIL  NODES  (THOSE  WITH  NO  FANOUT).  ALONG  THE 
OPTO  CREATES  NEW  STAGES  AS  FOLLOWS. 


WORKS 

WAY, 


OPTO  STARTS  OUT  WITH  ONE  (IMAGINARY)  STAGE  LATCH  LOCATED  JUST  BEFORE 
THE  ROOT  NODES. 


GIVEN  A  STAGE  LATCH,  OPTO  CREATES  A  NEW  STAGE  AS  FOLLOWS. 

FIRST  OPTO  WILL  TRACE  ALL  THE  PATHS  FROM  THE  GIVEN  LATCH  TOWARDS 
THE  TAIL  NODES.  OPTO  WILL  GO  AS  FAR  AS  POSSIBLE  FROM  THE  GIVEN 
LATCH,  SUCH  THAT  THE  DISTANCE  ALONG  EACH  PATH  TO  THIS  LATCH  IS  <=  TARG. 
SECOND  OPTO  WILL  CREATE  A  NEW  STAGE  LATCH  WHICH  CUTS  THE  GRAPH 
AT  THE  EXTREMETIES  OF  THE  TRACED  PATHS. 


THE  NODES  BETWEEN  THE  GIVEN  AND  THE  NEW  STAGE  LATCHES  FORM  A  NEW 
MAXIMAL  STAGE  WHOSE  LENGTH  IS  <=  TARG. 

OPTO  THEN  REPEATS  THIS  PROCEDURE,  STARTING  FROM  THE  NEW  STAGE  LATCH. 


OPTO  STOPS  WHEN  THE  TAIL  NODES  HAVE  BEEN  REACHED. 


END; 


EXAMPLE: 


NOTE:  NUMBERS  ARE  DELAYS 
ALL  PARALLEL  PATHS  ARE  ACTIVE 


OPTO  WITH  TARG  =  15  GIVES: 


FIG  15:  OPT1  EXAMPLE 


ORIGINAL  PARTITION 
(O.P.) 

CYCLE  TIME  =  40 


OPT1  WILL  TRY  MOVING  A,B,C  TO  THE  OTHER  SIDE  OF  (L) 
OPT1  STARTS  FROM  O.P.  FOR  EACH  MOVE: 


MOVE  A  => 
CYCLE  TIME  =  50 


MOVE  B  => 
CYCLE  TIME  =  35 


MOVE  C  => 
CYCLE  TIME  =  25 


(I) 


(II) 


(HI) 


OPT1  WILL  THEREFORE  PICK  (II).  AFTER  MOVING  B. 

OPT1  WILL  ITERATE  THESE  MOVES  ONCE  MORE.  BUT  WILL  FIND  NO  IMPROVEMENT 
IN  THE  CYCLE  TIME  ((II)  IS  OPTIMAL).  OPT1  WILL  THEN  STOP. 


FIG  16:  0PT2  EXAMPLE 


ORIGINAL  PARTITIONING 
CYCLE  TIME  =  10 


OPT2  WILL  TRY  THE  FOLLOWING  CONFIGURATIONS  AND  PICK  THE  BEST: 


CYCLE  TIME  =15  CYCLE  TIME  =  15 


IN  THIS  EXAMPLE,  THE  ORIGINAL  CONFIGURATION  YIELDS  THE  BEST  CYCLE  TIME. 


34 


configuration  until  a  new  local  minimum  is  found,  call  opt2  on  it,  and  so  on.  We  stop  when 
consecutive  passes  of  opt2  followed  by  iterations  of  optl  yield  no  improvement  in  the  cost 
function.  Appendix  1  shows  how  optl  and  opt2  are  used  in  sequence.  We  see  that  opt2 
helps  the  search  escape  from  optl’s  local  minima. 

2.5.  Skewed  Pipe  Generation 

Skewed  pipes  are  generated  in  two  steps:  we  first  pick  a  (regular)  stage  from  latch  Si  to 
latch  S2  that  is  a  good  candidate  for  "skewing”;  this  is  typically  the  longest  stage.  Next, 
we  "skew”  this  stage  by  inserting  a  latch  S’  about  half-way  on  the  longest  path  of  this 
stage.  As  Fig.17  shows,  we  then  have  a  skewed  stage  such  that  the  lengths  of  the  three 
new  component  skewed  stages  (Si  to  S’)  and  (S’  to  S2)  and  (Si  to  S2  with  S’)  are  all  strictly 
less  than  that  of  the  original  stage  (Si  to  S2).  As  we  mentioned,  this  is  optional  and  espe¬ 
cially  useful  to  avoid  long  but  rare  operations  (like  branches)  from  slowing  down  frequent 
ones  (like  loads/stores).  We  now  describe  how  the  algorithms  designed  for  SP  can  be  used 
to  determine  a  set  of  phase  lengths  for  sequencing  the  actual  datapath. 

3.  The  Phase  Calculation  Problem 

3.1.  Problem  Specification  For  PC 

The  objective  of  PC  is  to  break  up  the  cycle  into  a  set  of  non-overlapping  phases  such  that 
these  phases  can  be  efficiently  used  to  sequence  the  datapath  gates  and  latches.  The  input 
to  PC  consists  of  the  following  items. 
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FIG  17:  SKEWED  PIPE  EXAMPLE 


LATCH  (SI)  SKEW-LATCH  (S’)  LATCH  (S2) 


CRITICAL  PATH  TROUGH  STAGE 


WITHOUT  SKEW-LATCH  (S’):  CYCLE  TIME  =  40 
WITH  (S’):  CYCLE  TIME  =  20 


...  SEE  ALSO:  FIG. 12 
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Si(i  =  l,..jrST) 
MINIPHIS 


The  stage  latches  that  correspond  to  the 
best  partition  found  by  SP 

The  minimum  number  of  phases  in  a  cycle 


TARGPHI 


The  target  length  for  each  phase 


DPATH 


A  description  of  the  datapath  used  to  implement  the  system 
along  with  the  datapath’s  associated  constraints.  The  format 
for  DPATH  is  that  required  by  the  Phase  Assignment  step 
PA.  The  task  of  finding  a  suitable  datapath  is  left  up  to  the 
designer.  OPD  does  not  perform  datapath  synthesis. 


PC  then  calculates  a  set  of  at  least  MINIPHIS  phases  such  that  the  length  of  each  phase  is 
close  to  TARGPHI,  and  such  that  when  these  phases  are  used  to  "sequence’  the  above  data¬ 
path,  the  length  of  the  longest  pipe  stage  in  minimized.  The  Phase  Assignment  routine, 
PA,  decides  how  a  set  of  phases  is  used  to  sequence  the  given  datapath. 


3.2.  Using  SP  To  Solve  PC 

We  notice  that  PC  and  SP  are  very  similar;  the  objective  of  SP  is  to  break  up  a  dataflow 
graph  into  stages;  that  of  PC  is  to  break  up  the  cycle  time,  which  corresponds  to  the  length 
of  one  pipe  stage,  into  phases.  The  cost  function  for  SP  is  the  length  of  the  longest  pipe 
stage;  that  of  PC  is  the  length  of  the  longest  stage  after  PA  has  calculated  how  to  best  use 
the  given  phase  lengths  to  sequence  the  actual  datapath. 

Because  of  this  similarity,  we  re-use  SP  to  solve  PC.  To  map  PC  to  SP,  we  first  fold 
our  dataflow  graph.  This  step  involves  breaking  the  dataflow  graph  into  the  separate  pipe 
stages  found  by  SP;  these  stages  are  then  placed  in  parallel  to  form  the  folded  graph.  The 
folded  graph  therefore  has  only  one  pipe  stage,  and  the  length  of  that  stage  is  equal  to  the 
cycle  time  found  by  SP. 

We  now  run  SP  on  PC,  with  a  target  stage  length  (an  input  of  SP)  equal  to  the  target 
phase  length  we  want.  SP  will  then  break  up  the  folded  graph  into  stages,  where  the 
length  of  each  stage  is  close  to  our  target  phase  length.  By  calling  optO,  optl  and  opt2  on 
the  folded  graph,  SP  will  generate  a  range  of  possible  phase  lengths.  The  cost  function  we 
use  to  evaluate  a  phase  length  sequence  is  the  datapath’s  pipe  throughput,  as  calculated  by 
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the  Phase  Assignment  routine  after  it  actually  sequences  the  datapath.  This  procedure 
requires  that  the  datapath  be  re-sequenced  by  PA  every  time  SP  generates  a  new  set  of 
phase  lengths.  Fig. 18  shows  the  procedure. 

In  effect,  what  we  have  done  is  to  use  the  structure  of  the  dataflow  graph  to  generate 
a  good  set  of  phase  length  sequences  via  SP  on  the  folded  graph.  We  try  out  every 
sequence  of  phase  lengths  on  the  actual  datapath  via  PA  and  pick  the  phase  lengths  that 
lead  to  the  best  results. 

The  software  has  routines  to  fold  the  dataflow  graph  and  automatically  run  PA, 
Appendix  3  shows  PC  at  work  on  a  CPU  example  derived  from  the  RISC-II  [Katevenis  83]. 
As  we  see  from  this  example,  it  is  not  easy  to  find  a  good  set  of  phase  lengths  solely  from 
the  dataflow  or  even  the  datapath  graphs.  One  reason  for  this  is  that  only  certain  phase 
length  patterns  may  be  suitable.  For  instance,  the  designer  might  want  to  have  equal 
length  phases.  This  is  indeed  the  case  with  the  RISC-II  [Katevenis  83]  [Sherburne  84], 
Appendix  3  shows  that,  within  this  constraint,  the  phase  length  chosen  by  the  original 
designers  (120ns)  is  very  close  to  the  optimum.  However,  the  sample  runs  in  Appendix  3 
show  clearly  that  when  phases  are  used  to  sequence  detailed  MOS  gates,  there  is  no  simple 
relationship  between  stage  length  and  phase  lengths.  For  example,  slow  gates  may  run 
over  many  phases,  while  many  fast  static  gates  could  fit  into  the  same  phase.  The  relation¬ 
ship  between  dataflow  stages  and  phases  is  therefore  very  tenuous. 

Once  the  dataflow  graph  has  been  partitioned,  and  the  phase  lengths  have  been  calcu¬ 
lated,  we  can  use  the  Phase  Assignment  tool  (PA)  once  again  to  determine  on  which 
phase/cycle  each  gate  of  the  datapath  should  be  run.  When  PA  was  called  from  PC,  we  set 
control  parameters  for  PA  so  that  PA  would  run  fast  and  produce  an  approximate  result  as 
a  guide  to  the  higher  level  PC  task.  This  time,  we  can  call  PA  to  do  a  full-precision  job. 
The  next  chapter  describes  how  this  is  done. 


FIG  18:  PHASE  LENGTH  CALCULATION 


•FOLDED”  GRAPH 


STAGED 

IN  ORIGINAL  GRAPH 


STAGE#2 

IN  ORIGINAL  GRAPH 


STAGE#3 

IN  ORIGINAL  GRAPH 


BY  RUNNING  SP  ON  THE  FOLDED  GRAPH.  THE  CYCLE  IS  BROKEN  UP  INTO 
A  SET  OF  PHASES. 

EACH  SUCH  PARTITION  IS  USED  TO  SEQUENCE  THE  ACTUAL  DATAPATH  VIA  PA. 
PA  THEN  TELLS  US  HOW  FAST  THE  PIPE  WILL  RUN  WITH  THIS  SET  OF  PHASES; 
WE  PICK  THE  BEST  SET  OF  PHASES  GENERATED  BY  SP. 
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Section  III  -  Phase  Assignment 

PA  is  the  next  tool  in  the  OPD  set,  and  performs  Phase  Assignment  on  datapath  gates.  In 
this  chapter,  we  define  the  Optimal  Phase  Assignment  (PA)  problem.  We  use  the  Micro¬ 
code  Compaction  problem  [Fisher  81]  to  show  how  to  prove  that  PA  is  NP-complete.  We 
then  examine  related  problems  and  algorithms,  and  discuss  and  evaluate  two  approaches 
for  solving  PA.  The  first  one  is  probabilistic  and  uses  Simulated  Annealing  (SA).  We  dis¬ 
cuss  how  we  applied  SA  for  solving  PA,  and  use  an  analogy  with  physics  to  discuss  some  of 
the  tradeoffs  involved.  We  cover  the  annealing  cost  function,  the  move  generation  and  the 
choice  of  the  annealing  control  parameters.  The  second  approach  is  heuristic,  and  leads  to 
a  fast  algorithm  for  solving  PA. 

The  main  purpose  of  the  Simulated  Annealing  algorithm  is  to  produce  high-quality 
reference  solutions  for  a  few  examples.  We  use  these  solutions  to  demonstrate  the 
effectiveness  of  the  much  faster  heuristic  algorithm.  The  long  run-time  of  SA  is  therefore 
not  a  problem  in  our  case. 

We  also  show  that  the  constraint  set  chosen  for  PA  is  powerful  enough  to  capture 
most  usual  design  situations. 

1.  Problem  Specification 

1.1.  Input 

The  input  to  the  PA  optimization  problem  consists  of  a  register-transfer  level  description  of 
the  circuit,  plus  a  description  of  the  clock  and  the  pipeline  stages.  The  input  consists  of  the 
following  elements. 
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Bi  = 

TYi  (i  =  l,..rN) 

Dt  (i  = 

DAG 


(STlk,ST2k)  (*=!,„, S) 


Constraints 


A  set  of  iV  interconnected  blocks  (gates  and  latches).  A 
block  is  simply  an  operator;  it  can  be  as  simple  as  a  logic 
gate,  or  as  complex  as  a  whole  processor. 

The  type  of  each  block  (static  logic  gate,  dynamic,  domino, 
nora,  latch) 

The  propagation  delay  of  each  block  expressed  in  the  same 
units  used  to  specify  the  lengths  of  each  clock  phase. 

A  Directed  Acyclic  Graph  that  expresses  the  connectivity 
among  blocks.  We  show  later  how  common  causes  for  cyclic 
connectivity  graphs  can  be  worked  around. 

The  length  of  each  clock  phase.  There  are  P  phases.  These 
lengths  are  calculated  by  PC. 

For  each  one  of  the  S  pipeline  stages,  STl*  is  the  input  latch 
to  the  stage,  while  ST 2*  is  the  output  latch.  The  stages 
could,  for  instance,  be  those  calculated  by  SP.  Since  stages 
are  consecutive,  ST’1*  +  1  is  the  same  latch  as  ST 2*. 

There  are  5  types  of  constraints.  It  is  possible  to  force  a  block 
to  run  on  a  particular  phase,  to  make  a  set  of  blocks  run  on 
the  same  phase,  to  force  a  set  of  blocks  to  run  on  disjoint 
phases,  or  on  different  starting  phases,  and  to  make  pipeline 
stages  run  at  non-overlapping  times  (so  as  to  allow  resource 
sharing,  for  example). 


Appendix  2  describes  the  input  and  the  output  format  for  PA.  Appendix  2  also  shows  two 
examples,  a  small  test  case  and  a  datapath  example  derived  from  the  "fraction  datapath”  of 
the  SPUR  [Patterson  87]  [Hill  85]  Floating  Point  Unit  [Adams  86]. 


1.2.  Optimization  Objective 

PA  does  not  modify  the  datapath  or  the  stage  partitioning  since  those  tasks  have  already 
been  performed  earlier  during  the  design  by  the  designer  and  SP/PC.  The  objective  is  to 
assign  a  clock  phase  to  each  block  so  as  to  maximize  the  throughput  of  the  pipeline  subject 
to  all  given  constraints.  In  fact,  the  program  assigns  a  fire  time  F,  (i  =  l,..,A0  to  each  block 
Bi.  The  fire  time  is  the  absolute  time  when  the  block  can  start  evaluating.  For  a  static 
block,  this  is  the  time  when  the  inputs  are  ready.  The  precedence  constraints  imply  that: 


For  any  blocks  Bt  and  Bj  such  that  there  is  an  arc  from  Bt  to  Bj  in  the  connectivity 


41 


DAG,  we  should  have  F.+D.SF,.  This  ensures  that  the  inputs  to  Bj  will  be  stable 
when  it  uses  them. 


The  objective  of  maximizing  the  pipeline  flow  rate  is  replaced  by  the  simpler  aim  of  minim¬ 
izing  the  longest  stage  length.  This  objective  was  chosen  because  the  time  spent  in  the 
slowest  stage  is  often  what  limits  the  overall  throughput.  The  simplified  objective  can  be 
expressed  as: 


Minimize  the  Max  of  the  difference  between  the  fire  time  of  the  output  latch  minus 
the  fire  time  of  the  input  latch  for  each  pipe  stage,  over  all  the  pipe  stages,  or: 


*=s 

Minimize  Max\ 


^[-FsT2U)--i 


ST1<*) 


1.3.  Related  Optimization  Problems  and  Algorithms 

The  Micro-code  compaction  (MC)  problem  [Fisher  81]  can  be  reduced  to  PA. 

In  order  to  solve  MC  using  PA,  we  map  each  micro-op  to  a  block  whose  delay  is  equal 
to  the  time  taken  to  execute  that  micro-op.  We  use  the  connectivity  DAG  to  specify  the 
precedence  constraints  among  micro-ops.  We  can  describe  resource  conflicts  among  micro- 
ops  by  introducing  a  constraint  that  specifies  the  blocks  that  should  not  be  simultaneously 
used.  For  example,  if  the  micro-engine  has  only  one  adder,  we  would  create  a  constraint  to 
specify  that  two  ADD  blocks  should  not  be  used  at  the  same  time.  Lastly,  we  define  a  sin¬ 
gle  pipeline  stage  that  corresponds  to  the  total  execution  time.  This  mapping  of  MC  to  PA 
can  be  done  in  polynomial  time. 

An  optimal  solution  to  the  corresponding  PA  problem  will  produce  an  optimal  packing 
of  the  micro-ops  into  horizontal  micro-instructions.  MC  is  therefore  reducible  to  PA.  As  we 
mentioned  earlier,  MC  is  NP-complete.  It  therefore  follows  that  MC  and  PA  are  both  NP- 
complete. 

However,  it  is  difficult  to  use  micro-code  compaction  algorithms  directly  for  PA.  Most 
such  algorithms  optimize  in  a  fairly  local  way,  such  as  [Landskov  80],  performing  code 
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motion  mainly  within  basic  blocks.  Those  that  apply  global  reorganization  [Fisher  81]  rely 
on  the  structure  of  typical  micro-programs  in  order  to  achieve  fast  algorithms.  This  makes 
them  less  applicable  to  PA. 

PA  is  also  analogous  to  a  one-dimensional  constrained  multi-layer  compaction  prob¬ 
lem.  To  see  this,  we  map  each  logic  block  Bt  to  a  line  of  length  D,,  the  delay  of  the  block. 
The  absolute  x-coordinate  of  the  line  in  the  layout  Xt  corresponds  to  the  fire  time  F j  of  the 
block.  We  also  create  as  many  layers  as  there  are  pipeline  stages,  and  map  all  the  blocks 
in  one  stage  to  the  corresponding  layer.  The  objective  of  minimizing  the  max  stage  length 
then  maps  to  one  of  minimizing  the  layout  size,  subject  to  the  ordering  relation  specified  by 
the  connectivity  DAG.  Our  constraints  map  into  layout  constraints  that  are  somewhat 
bizarre;  for  instance,  firing  a  block  on  the  fixed  phase  d>!  maps  into  placing  the  line  such 
that  its  x-coordinate  is  an  integer  multiple  of  some  quantity  (the  cycle  time).  Moreover, 
some  of  the  constraints  in  PA  are  global  -  while  layout  compaction  is  a  local  process.  This 
is  why  it  is  difficult  to  apply  standard  layout  compaction  algorithms  such  as  [Hsueh  81] 
[Weste  81]  to  solve  PA. 

A  typical  circuit  might  have  a  hundred  blocks,  with  a  total  cycle  time  of  500  units  (for 
ex.  500ns).  There  might  be  about  50  or  100  constraints;  this  gives  us  a  search  space  with 
one  hundred  dimensions.  Of  course,  the  constraints  reduce  this  number. 

The  large  search  space  is  the  main  reason  why  integer  programming  techniques,  such 
as  those  used  by  [Leiserson  83]  for  retiming,  were  not  chosen  for  PA.  Dynamic  Program¬ 
ming  can  not  be  applied  since  the  Principle  of  Optimality,  as  described  in  [Horowitz  84]  p. 
199,  is  not  satisfied  by  our  problem. 

2.  Simulated  Annealing  Based  Phase  Assignment 

2.1.  The  Generic  Algorithm 

Simulated  Annealing  (SA)  [Kirkpatrick  83]  is  a  general  search  technique  suitable  for  solv¬ 
ing  constrained  multi-variate  optimization  problems  In  the  case  where  there  are  no 
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constraints,  the  objective  is  to  minimize  a  given  function  F{xlt..,Xfj),  where  each  variable  x, 
belongs  to  a  domain  Dt,i  =  1  ,N  of  possible  values. 

SA  works  by  establishing  a  physical  system  that  is  analogous  to  the  problem.  Such  a 
system  is  described  by  giving  its  state  vector  &=(xl,...^cN)  and  its  energy  function  £(?).  We 
can  think  of  this  system  as  being  a  crystal,  for  instance.  It  is  possible  to  find  a  state  of 
minimal  energy  for  a  physical  system  by  using  the  (physical)  process  of  annealing.  First, 
the  crystal  is  melted  by  heating  it  up;  the  energy  E  is  then  high,  the  state  x  is  random. 
Then,  the  crystal  is  progressively  cooled  (annealed);  the  temperature  T  is  reduced  by  small 
steps,  waiting  for  thermal  equilibrium  at  each  temperature.  Finally,  when  T  is  low 
enough,  the  crystal  settles  into  a  stable  state  of  minimal  energy. 

If  we  consider  our  x  to  be  the  state  vector  of  a  physical  system  whose  energy  function 
is  E  =  F(D,  then  we  can  minimize  F  by  simulating  the  physical  annealing  of  the  crystal. 
Of  course,  we  also  need  to  introduce  a  parameter  T,  that  is  the  analogue  of  physical  tem¬ 
perature. 

This  simulation  is  performed  via  the  Metropolis  algorithm  [Metropolis  53],  which  was 
used  in  the  early  days  of  scientific  computing  to  simulate  many-bodied  physical  systems  in 
equilibrium  at  a  given  temperature.  This  method  determines  the  expected  values  of  vari¬ 
ables  of  the  system  at  a  given  temperature  T  by  generating  a  set  of  states  Si,S2,  •  '  '  that 
are  representative  of  the  system’s  behavior  at  T .  The  variables  are  then  averaged  over  this 
set  of  states.  The  states  are  generated  one  by  one.  A  new  state  S’  is  generated  from  the 
previous  state  S  by  randomly  changing  one  of  the  state  variables  and  calculating  the 
change  in  energy  A E  that  would  result;  the  new  state  is  then  entered  into  the  set  of 

representative  states  with  a  probability  equal  to  the  Boltzmann  factor  e  ,  or  1  if  AE^O. 
The  first  state  to  enter  the  set  can  be  random.  Fig. 19  outlines  the  SA  algorithm  and  the 
Metropolis  technique.  To  cool  the  system,  T  is  multiplied  by  the  cooling  factor  a,  usually 


chosen  such  that  0.8Sa<l. 


FIG  19:  GENERAL  SA  ALGORITHM 


procedure  SA; 

/*  for  each  temperature  *1 

while  outer-loop<riterion  not  satisfied,  do 

I*  do  the  Metropolis  simulation  */ 

while  inner-loop-criterion,  not  satisfied,  do 

generate  move  at  random; 
evaluate  A E  for  move; 

if  AE<0  accept  move; 

A£ 

else  if  e  T  >  K0_1  accept  move; 

I*  R  0_1  denotes  a  random  number  between  0  and  1  */ 
else  reject  move; 

if  move  was  accepted,  update  configuration  to  new  state, 
end  while; 

I*  end  of  Metropolis  simulation  *1 

I*  update  Temperature  Temp  to  cool  system;  */ 

T  =  T*a\ 

end  while; 
end  SA; 
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SA  is  therefore  characterized  by  three  procedures: 

•  The  Move  Generator,  which  decides  which  variable  should  be  randomly  changed, 
and  to  what  value,  in  order  to  generate  S’  from  S; 

•  The  Inner  Loop  Criterion,  which  decides  when  enough  representative  states  have 
been  generated  at  a  given  temperature  to  ensure  that  the  Monte-Carlo  Metropolis 
simulation  truly  reflects  the  physical  state  of  the  system  in  thermal  equilibrium  at 

T ; 

•  The  Outer  Loop  Criterion,  which  decides  when  the  system  has  been  cooled  to  a 
sufficiently  low  temperature. 

In  the  physical  process  of  annealing,  it  is  necessary  to  cool  the  substance  very  slowly  to  pro¬ 
duce  a  proper  crystal.  If  the  cooling  is  too  rapid  (quenching),  the  substance  may  enter  a 
metastable  state  that  corresponds  to  a  local  minimum  of  the  energy.  This  local  minimum 
can  be  quite  far  from  the  global  minimum.  The  number  of  variables  required  to  describe  a 
physical  system  may  be  of  the  order  of  Avogadro  s  number,  or  1023  — 2  Moreover,  the 
number  of  states  generated  per  temperature  for  each  variable  can  be  estimated  by  taking 
the  ratio  of  the  time  needed  by  the  crystal  to  attain  equilibrium  at  a  fixed  temperature 
divided  by  the  characteristic  vibration  frequency  of  the  crystal  lattice.  This  ratio  may  be 
hours  or  days  divided  by  pico-seconds,  or  roughly  1015  =  2' 

It  follows  that,  in  practice,  the  SA  algorithm  can  not  perform  a  true  simulation  of  the 
physical  process  of  annealing.  Even  with  the  fastest  CPU  s  available,  it  is  very  difficult  to 
generate  more  than  a  few  hundred  states  per  variable  per  temperature.  One  reason  why 
physical  crystals  can  go  through  so  many  states  in  a  fairly  short  time  (hours)  is  that  the 
crystal  is  a  highly  parallel  analog  computer,  with  1023  or  so  atoms  working  together,  and 
the  clock  cycle  is  very  fast  (picoseconds). 

Practical  SA  implementations  can  make  up  for  the  small  number  of  moves  by  generat¬ 
ing  them  in  a  smart  fashion,  so  as  to  rapidly  go  through  a  set  of  representative  states.  In 
particular,  it  is  pointless  to  generate  moves  that  will  systematically  be  refused.  We  want 
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to  generate  moves  that  result  in  a  value  of  A E  that  is  negative  or  not  too  large  compared  to 

T.  This  ensures  that  the  probability  of  accepting  the  new  state,  given  by  min  l,e  T  is 
not  minuscule.  Using  such  techniques,  SA  provides  a  robust  search  method  that  is  capable 
of  getting  out  of  local  minima  by  accepting  uphill  moves  with  a  probability  given  by  the 
Boltzmann  factor. 

To  handle  a  particular  constraint  in  SA,  we  add  a  corresponding  term  to  the  objective 
function  F  to  be  minimized.  This  term  C  is  a  penalty  function,  and  takes  on  large  positive 

values  when  there  exists  a  constraint  violation.  At  high  temperatures,  SA  will  therefore 

AE 

explore  configurations  that  do  not  meet  some  of  the  constraints,  since  -y  will  be  small 

even  if  A E  is  large.  This  leads  to  a  value  of  e  T  that  is  large  enough  to  accept  the  state, 
even  if  there  is  a  constraint  violation.  However,  as  T  decreases,  the  system  will  refuse 
moves  that  increase  C,  and  settle  in  a  state  that  minimizes  F  +  C,  thereby  avoiding  con¬ 
straint  violations.  The  other  approach  is  to  generate  the  moves  so  as  not  to  cause  con¬ 
straint  violations. 

We  now  discuss  how  we  have  used  SA  to  solve  the  PA  problem.  We  cover  the  choice 
of  the  energy  function,  the  move  generation,  and  the  annealing  control. 

2.2.  Energy  Function  and  State 

In  our  use  of  SA  for  PA,  the  state  of  the  system  is  defined  by  the  vector  F~f  =  {F  1,..rFiV)  of 
the  fire  times  of  each  block.  We  take  the  penalty  function  approach  to  ensure  that  the 
given  constraints  are  met  by  the  final  solution. 

2.2.1.  Basic  Terms 

The  Energy  function  F(FT)  has  three  terms: 
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F(FT)  =  SL  +  ST  +  PEN 

gE  The  slack  cost,  proportional  to  the  sum  of  the  squared  signal  slacks. 

The  slack  on  a  signal  that  connects  the  output  of  one  block  A  to  the 
input  of  another  block  B  is  the  amount  by  which  we  could  delay  the 
output  of  A  and  still  have  the  value  ready  before  B  fires. 

gp  The  stage  length  term,  proportional  to  the  sum  of 

the  squared  length  of  each  pipe  stage. 

PEN  The  penalty  function,  to  ensure  that  constraints  are  met.  We  have 

different  penalty  terms  for  each  type  of  constraint;  in  general,  the 
penalty  function  is  equal  to  a  constant  plus  the  square  of  the 
amount  by  which  the  constraint  is  violated.  For  instance,  for  a  non¬ 
overlap  constraint,  the  cost  would  be  a  constant  plus  the  square  of 
the  overlap  time  between  the  concerned  blocks.  If  the  corresponding 
constraint  is  met,  the  penalty  is  zero. 


The  terms  are  therefore  calculated  as  follows: 

SL  slscale*sum  slacks2 

ST  stscale*sum  ( stage  lengths )2 

PEN  pen-offset  +  penscale* (amount  of  violation)2 

(only  if  the  corresponding  constraint  is  broken). 

Note:  we  have  one  PEN  term  for  each  (broken)  constraint; 
we  sum  all  these  terms  into  the  cost  function. 

sl_scale,  st_scale,  pen_scale  are  scaling  factors.  pen_offset  is  chosen  to  ensure  that  all  con¬ 
straint  violations  will  be  removed  at  T  =  0. 


2.2.2.  Balancing  Cost  Function  Terms 

The  choice  of  the  offset  and  of  the  scale  factors  for  the  various  terms  in  the  cost  function 
heavily  influences  the  optimality  of  the  final  solution. 

If  one  of  the  energy  terms  is  much  larger  than  the  others,  then  that  term  will  dom¬ 
inate  the  annealing  process  at  high  temperatures.  For  instance,  if  we  have  a  very  large 
stage  length  term,  the  SA  algorithm  will  concentrate  on  minimizing  the  stage  length  at 
high  T,  while  it  will  try  minimizing  SL  and  PEN  only  at  lower  temperatures.  In  effect,  if 
the  cost  function  terms  are  unbalanced,  the  minimization  objectives  are  separated  according 
to  temperature.  The  algorithm  will  therefore  not  minimize  all  the  criteria  simultaneously; 
this  generally  leads  to  sub-optimal  solutions. 
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We  tried  a  very  large  ST  term,  and  found  that,  at  high  T ,  SA  decided  to  crush  the 
stage  lengths  down,  even  at  the  expense  of  breaking  certain  constraints.  As  T  decreases,  it 
might  turn  out  impossible  to  remove  the  constraint  violations.  On  the  contrary,  if  the 
terms  are  well  balanced,  SA  will  simultaneously  optimize  for  all  the  criteria.  This  requires 
slower  cooling,  but  produces  better  results. 

In  order  to  choose  well  balanced  factors,  we  reason  on  the  physical  system  that  SA 
simulates.  Since  the  energy  functions  are  of  the  form  x2,  this  system  consists  of  a  set  of 
blocks  interconnected  by  perfect  springs.  The  scale  factors  correspond  to  the  stiffness  of  the 
springs.  Fig.20  shows  the  equivalent  system.  We  see  that  the  stage  term  springs  connect 
very  distant  blocks,  while  the  slack  term  springs  tie  up  neighboring  blocks.  Similarly,  the 
penalty  function  terms  are  short  range,  since  they  tie  a  block  to  the  nearest  slot  that  does 
not  violate  the  constraint.  We  want  the  forces  exerted  by  the  springs  to  be  reasonably  bal¬ 
anced,  so  as  to  avoid  crushing  the  system. 

From  this  analogy,  it  is  immediately  clear  that  the  penalty  offset  pen_offset  should  be 
larger  than  the  amount  by  which  the  slack  term  could  be  reduced  by  allowing  constraint 
violations: 

pensffset  >  slack,j{constrained)  —  slacktj{unconstrained) 
for  any  single  constraint  violation.  This  ensures  that  those  springs  will  not  crush  the 
blocks  and  make  them  break  constraints. 

A  set  of  slack  term  springs  in  series  is  connected  to  a  single  stage  term  spring.  Typi¬ 
cally,  there  as  many  slack  springs  in  series  as  blocks  per  stage,  blst.  Since  blst  springs 

in  series  have  a  combined  stiffness  equal  to  —  times  that  of  a  single  spring,  we  need  to 

blst 

weaken  the  stage  term  scale  factor.  This  ensures  that  the  stage  term  will  not  crush  the 
blocks  and  cause  precedence  violations.  Using  this  model  and  experimentation,  we  arrived 
as  the  following  energy  function  factors.  These  factors  do  not  account  for  blocks  in  parallel; 
this  omission  can  be  overcome  by  more  conservative  annealing  parameters,  which  leads  to 
longer  run  times.  However,  this  was  not  a  problem  in  our  case  since  we  used  SA  only  to 
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FIG  20:  PHYSICAL  EQUIVALENT  OF  PA 


NOTES:  Bi  ARE  THE  BLOCKS 

Ki  ARE  THE  SPRING  STIFFNESSES 

Li  ARE  THE  LATCHES 


STAGE  =  FROM  LATCH  (LI)  TO  LATCH  (L2) 
Kl  =  STAGE  SPRING 

K2  =  LOCAL  CONSTRAINT  FOR  BLOCK  B2 
K3  =  PRECEDENCE  CONSTRAINT 
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produce  reference  solutions. 


Parm 

sl-scale 

stscale 

pen-offset 

pen^scale 


Value 


1 


(dynamic) 


V blst 


Why 

Reference  Factor 
avoid  crushing  blocks 

should  exceed  the  reduction  in  the  slack  and 
stage  cost  functions  that  is  achieved  if  we  can 
gain  one  clock  cycle  by  breaking  constraints. 

the  stiffness  of  these  springs  should  be  blst 
times  greater  than  that  of  the  stage  length 
term  springs  to  avoid  precedence  violations. 


2.3.  Move  Generation 

We  describe  three  flavors  of  move  generation,  from  random  to  smart,  heuristic  move  gen¬ 
eration. 

For  random  move  generation,  we  select  the  blocks  sequentially,  and  displace  each 
block  by  generating  a  new,  random  fire  time.  The  move  generation  is  fast,  but  this  results 
in  a  less  effective  generation  of  useful  moves  than  if  exchange  moves  are  used,  as  in  the 
PLA-folder  genie  [Devadas  86].  PA  is  more  similar  to  placement  problems  [Sechen  85] 
than  it  is  to  permutation  problems  such  as  the  Traveling  Salesman  Problem  [Kirkpatrick 
83].  While  annealing  can  rapidly  solve  the  Traveling  Salesman  Problem  with  thousands  of 
cities  [Kirkpatrick  83],  it  takes  much  more  CPU  time  to  solve  thousand-block  placement 
problems. 

The  Range  Limiter  mechanism  helps  generate  more  useful  moves.  At  low  tempera¬ 
tures,  only  local  moves  that  do  not  perturb  the  system  too  much  will  be  accepted.  We 
therefore  generate  new  fire  times  that  are  within  a  specified  range  of  the  old  ones.  As  the 
cooling  progresses,  we  reduce  the  extent  of  this  range.  In  our  SA,  the  range  is  reduced  so 
as  to  ensure  that  at  least  5%  or  10%  of  the  generated  moves  are  accepted. 

We  experimented  with  a  move  generator  that  never  broke  precedence  constraints. 
However,  we  found  that  this  lead  to  sub-optimal  solutions,  or  to  no  solution  at  all  in  some 
cases.  SA  tends  to  get  stuck  in  illegal  or  sub-optimal  solutions  from  which  it  can  not 
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escape  if  precedence  constraints  are  always  satisfied. 

We  also  tried  another  approach  to  respecting  the  precedence  constraints.  We  re¬ 
schedule  a  large  part  of  the  network  each  time  a  (random)  move  that  breaks  precedence  is 
generated.  We  found  that  SA  had  trouble  gauging  the  effect  of  any  one  move.  This  is 
because  every  move  is  followed  by  a  heuristic  procedure  that  may  re-organize  large  parts  of 
the  circuit  before  generating  the  next  move.  We  also  experimented  with  a  move  generator 
that  always  satisfies  a  larger  set  of  constraints  (all  local  constraints).  But  the  different  con¬ 
straints  interact,  and  such  an  approach  prevents  SA  from  considering  them  all. 

A  better  way  might  be  to  anneal  only  on  the  global  (scheduling,  non-overlap)  con¬ 
straints  and  use  heuristic  algorithms  to  re-assign  phases  each  time  one  of  the  scheduling 
constraints  is  changed.  However,  since  our  main  purpose  in  coding  SA  was  to  generate 
reference  solutions  to  validate  faster  heuristic  algorithms  for  PA,  the  long  run  times  did 
not  appear  to  be  a  problem. 

2.4.  Annealing  Control 

Annealing  control  refers  to  the  inner  loop  criterion,  the  outer  loop  criterion  and  the  choice 
of  parameters. 

The  inner  loop  criterion  decides  when  a  sufficient  number  of  states  have  been  gen¬ 
erated  at  a  given  temperature.  We  have  shown  that  it  is  not  practical  to  generate  as  many 
states  as  in  the  physical  annealing.  In  practice,  we  generate  a  fixed  number  NS{T )  of 
states  per  block  per  temperature.  NS(T )  is  on  the  order  of  100  to  300.  This  implies  that 
we  do  not  test,  and  therefore  do  not  wait,  for  thermal  equilibrium.  In  fact,  the  curve  in 
Fig. 21  shows  clearly  that  we  do  not  come  anywhere  close  to  equilibrium.  The  correct 
energy  function  -  had  we  waited  at  each  T  -  is  a  monotonically  decreasing  function  of  T . 
The  fact  that  the  energy  can  sometimes  go  up  as  T  goes  down  shows  that  the  Monte-Carlo 
Metropolis  simulation  is  not  exact.  This  scheme  is  similar  to  that  used  by  the  placement 
package  TimberWolf  [Sechen  85]  and  the  PLA-folder  Genie  [Devadas  86]. 


FIG  21:  ENERGY  (COST)  VS.  TEMPERATURE 
ANNEALING  -  FPU  RUN 
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We  ran  an  experiment  in  which  we  waited  for  the  average  energy  to  stabilize  at  each 
temperature  before  cooling.  This  is  somewhat  analogous  to  achieving  thermal  equilibrium. 
We  found  that  NS{T)  increased  roughly  by  a  factor  of  10.  E{T)  also  became  a  monotoni- 
cally  decreasing  function.  However,  the  quality  of  the  results  did  not  improve  enough  to 
justify  a  ten-fold  increase  in  computing  time;  we  therefore  use  the  non-equilibrium  scheme. 

In  order  to  improve  the  quality  of  the  final  solution,  we  increase  NS{T )  as  T 
decreases.  It  is  sufficient  to  generate  a  small  number  of  states  at  high  temperatures  since 
the  system  is  in  a  random  configuration  anyway  and  since  SA  is  only  performing  an 
approximate  exploration.  However,  as  the  system  is  cooled,  it  becomes  useful  to  generate 
more  states  in  order  to  evaluate  the  energy  with  greater  precision.  In  our  implementation, 
we  divide  NS(T)  by  the  cooling  factor  a  when  the  fraction  of  accepted  moves  drops  sud¬ 
denly.  Such  a  situation  occurs  when  the  energy  is  rapidly  changing,  corresponding  to  a 
phase  transition  in  the  system.  It  is  then  desirable  to  generate  more  states  to  observe  this 
transition  better. 

The  outer  loop  criterion  is  simple.  If  the  energy  has  not  changed  appreciably  over  the 
last  three  temperatures,  we  stop  the  annealing  process.  This  criterion  has  proven  useful  in 
TimberWolf  [Sechen  85]. 

In  practice,  we  found  that  NS(T )  should  be  greater  than  150  generated  states 
(accepted  or  refused)  per  block  per  temperature.  Also,  the  offset  term  pen-offset  for  the 
penalty  function  should  be  just  large  enough,  but  not  more.  The  best  results  are  obtained 
with  a  ^  0.90  to  ensure  slow  cooling.  We  also  found  that  both  NS{T)  and  T  have  to  be 
much  higher  if  the  problem  involves  global  constraints.  Other  annealing-based  programs 
also  have  trouble  with  global  constraints;  for  instance,  placement  packages  using  SA  are 
not  well-suited  to  regular,  highly-constrained  datapath  placement.  For  such  problems,  the 
choice  of  annealing  parameters  is  very  critical.  For  the  SPUR  FPU  and  an  example 
derived  from  the  RISC-II  [Katevenis  83]  [Sherburne  84],  run  times  are  about  0.5  to  1.5 
hours  of  VAX8800  CPU.  Our  main  aim  in  coding  SA  is  to  verify  the  solutions  produced  by 
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heuristic  algorithms  described  next.  In  both  cases,  SA  did  not  find  a  significantly  better 
result  than  the  heuristic  algorithms.  Fig.22a  and  Fig.22b  show  how  the  move  generation 
range  and  the  number  of  moves  generated  vary  with  temperature.  There  is  a  noticeable 
phase  transition  around  T  =  105,  as  evidenced  by  a  rapid  decrease  in  the  range  of  those 
moves  that  the  annealing  generates.  The  program  adjusts  this  range  in  an  attempt  to 

ensure  that  at  least  5%  or  10%  of  the  moves  generated  are  accepted.  The  rapid  decrease  of 

« 

this  range  takes  place  because  the  program  is  responding  to  a  rapid  decrease  of  the 
system’s  randomness.  The  system  is  suddenly  "solidifying”. 

2.5.  Coding  Issues 

In  order  to  ensure  reasonable  run  times,  we  calculate  the  energy  functions  in  an  incremen¬ 
tal  fashion.  Every  time  a  move  is  generated,  we  re-calculate  only  those  terms  that  change 
in  the  energy  function.  To  achieve  this,  we  use  data  structures  that  immediately  tell  us 
which  constraints  and  which  terms  depend  on  a  particular  block  s  fire  time.  Also,  much  of 
the  annealing  code  is  inline  -  a  compiler  that  could  automatically  expand  functions  inline 
would  be  very  helpful  here.  It  is  worth  avoiding  function  calls  in  the  critical  loop  since  this 
loop  may  be  executed  several  million  times. 

In  order  to  debug  the  annealing  code,  we  compare  the  energy  as  calculated  incremen¬ 
tally  with  the  energy  re-calculated  at  every  accepted  move.  We  found  the  direct  calculation 
of  the  energy  to  increase  compute  time  by  about  a  factor  of  10.  Overall,  our  annealing  code 
was  more  difficult  to  debug  than  usual  because  of  the  need  for  very  high  speed,  its  exotic 
data  structures,  and  a  rather  long  critical  loop. 

3.  Heuristic  Algorithms  for  Phase  Assignment 

We  describe  heuristic  algorithms  that  provide  good  solutions  rapidly,  within  seconds  for 
typical  problems.  Such  algorithms  make  it  possible  to  use  OPD  interactively.  These  algo¬ 
rithms  are  built  upon  GPA,  a  Greedy  Phase  Assignment  algorithm. 


FIG  22(a) 
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FIG  22(b) 

RANGE  OF  ANNEALING  (MAX  RANDOM  DISPLACEMENT  OF  A  BLOCK’S  FIRE  TIME) 
VERSUS 
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3.1.  GPA:  Greedy  Phase  Assignment 

The  basic  GPA  performs  either  a  breadth-first  or  a  depth-first  trace  of  the  network.  Each 
block  is  assigned  the  earliest  fire  time  that  meets  all  the  constraints  involving  it.  Con¬ 
straints  that  refer  to  blocks  that  are  not  yet  scheduled  are  ignored.  GPA  using  depth-first 
search  guarantees  a  solution  if  there  exists  one;  this  is  because  any  constraint  can  be  met 
by  delaying  a  set  of  blocks  sufficiently. 

3.2.  HA:  Heuristic  Algorithm 

There  are  two  reasons  why  GPA  often  produces  sub-optimal  solutions.  GPA  may  schedule 
a  block  to  run  too  early,  which  can  force  other  blocks  to  be  delayed  in  order  to  meet  the 
constraints.  This  leaves  dead  slack  in  the  network  and  decreases  throughput.  Also,  global 
constraints  such  as  specifying  that  two  stages  A  and  B  must  not  overlap  in  time  can  have 
two  valid  solutions.  Either  all  the  blocks  in  A  can  run  before  any  block  of  B,  or  the 
reverse.  GPA,  by  its  nature,  does  not  consider  any  such  exchange  moves  between  indepen¬ 
dent  blocks. 

Fig. 23  outlines  HA.  HA  visits  each  block  B  in  the  network  level  by  level,  using 
depth-first  ordering.  For  each  B,  HA  tries  NTRIALS  consecutive  possible  fire  times.  For 
each  possible  fire  time,  HA  calls  GPA  to  re-schedule  the  part  of  the  network  that  topologi¬ 
cally  follows  the  block.  The  fire  time  for  B  that  produced  the  best  schedule  is  selected. 

HA  therefore  re-schedules,  on  the  average,  half  the  network  for  each  block.  The  run 
time  grows  as  the  square  of  the  size  of  the  network.  HA  is  capable  of  delaying  a  block’s  fire 
time  to  find  a  better  schedule,  and  can  consider  the  interchange  of  independent  blocks  by 
delaying  one  of  them  by  a  sufficient  number  of  phases.  HA  attempts  to  pack  stages  that 
should  not  overlap  in  as  compact  a  fashion  as  possible;  however,  should  this  packing  fail, 
HA  switches  to  an  algorithm  that  guarantees  a  solution  but  might  decrease  throughput. 
This  algorithm  delays  all  the  blocks  of  one  of  the  stages  to  ensure  non-overlap. 

The  mechanism  that  HA  uses  to  maintain  a  global  view  of  the  network  is  similar  to 
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Figure  23:  Heuristic  Phase  Assignment  Algorithm  HA 

{  Picks  best  fire  time  for  a  given  block  B.  Tries  NTRIALS  different  times  } 
procedure  ONE-HA  (integer  NTRIALS ,  block  B) 

let  T  =  earliest  fire  time  for  B  consistent  with  constraints; 
for  I  from  1  to  NTRIALS  do 

unschedule  all  blocks  topologically  following  B; 

{  Try  scheduling  B  after  T,  by  the  Greedy  scheduler  } 

{  This  move  checks  if  a  better  solution  results  by  delaying  B  } 
call  GPA  (block  =  B,  min-fire-time  =  T); 

{  re-schedule  that  part  of  the  network  which  topologically  follows  B  } 
schedule  all  other  blocks  on  the  same  DFS  level  as  B 
and  following  B  by  calling  GPA  on  them; 

if  this  is  a  solution  that  meets  all  constraints,  record  it. 

let  T  =  next  sensible  fire  time  to  try  for  B; 

end  for; 

Pick  the  best  found  solution:  this  gives  the  fire  time  for  B; 
end  ONE-HA; 

{  Note:  the  run  time  of  ONA-HA  is  proprotional  to  Network-size  *  NTRIALS  } 

{  Finds  a  good  assignment  for  the  whole  network  } 
procedure  HA  (NTRIALS); 

for  each  block  B  of  the  network  in  topological  DFS  order, 
call  ONE-HA  (B,  NTRIALS ); 

{  Note:  this  progressively  fixes  the  fire  time  of  each  block. 

The  network  is  in  effect  "zone-refined". 

} 

If  no  solution  found,  switch  to  a  guaranteed  algorithm, 
end  HA; 

{  Note:  the  run  time  of  HA  is  proprotional  to  (Network-size  2)  *  NTRIALS  } 
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the  Zone  Refining  (ZR)  process,  as  applied  to  compaction  problems  [Shin  86].  Unlike  ZR, 
HA  has  to  re-process  the  whole  part  of  the  network  that  follows  the  point  of  change  every 
time  a  zone  gets  changed  since  constraints  can  be  global.  Run  times  are  usually  of  the 
order  of  a  few  seconds. 

Fig.24  shows  how  the  quality  of  the  solution  improves  with  NTRIALS . 

3.3.  Other  Heuristics  Considered 

In  order  to  handle  global  scheduling  constraints,  we  considered  using  urgency-scheduling 
[Park  85].  This  method  orders  blocks  according  to  their  distance  to  the  end  of  the  network 
(forward-urgency)  or  from  the  start  of  the  network  (backward-urgency).  The  distance  of  a 
block  to  the  end  of  the  network  is  the  length  of  the  longest  path  from  that  block  to  a  termi¬ 
nal  block. 

Urgency-scheduling  consists  of  scheduling  the  blocks  with  the  highest  urgency  first. 
This  method  was  not  chosen  because  the  local  constraints  interact  with  the  global  ones;  and 
urgency-scheduling  has  no  mechanism  to  explore  these  interactions.  [Park  85]  describes 
applications  of  urgency-scheduling  to  synthesis  problems  that  do  not  involve  as  many  con¬ 
straints  as  ours. 

Another  approach,  used  by  A.  Parker  in  the  datapath  synthesizer  MAHA  [Parker  85], 
is  to  schedule  the  blocks  according  to  their  freedom  [Nagle  81].  Loosely,  the  freedom  of  a 
block  is  the  difference  between  the  earliest  time  at  which  it  can  be  scheduled  and  the 
latest.  Freedom-based  scheduling  was  not  chosen  for  similar  reasons. 

Local  operations  to  "shake’  the  network  appear  insufficient,  since  these  canhot  explore 
global  interchanges  while  respecting  the  constraints. 


4.  Constraint  Set  Completeness 

The  constraint  set  is  designed  to  capture  most  of  the  common  hardware  and  scheduling  con¬ 
straints,  so  as  to  adapt  to  any  technology  and  any  design  style. 


NTRIALS=0 


2 
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•  Shared  Resources:  busses,  register  files,  ALUs.  Fig. 25  shows  an  example.  Shared 
resources  are  modeled  as  follows.  Let  RESC  be  a  shared  resource.  We  create 
blocks  RESC  i,...,RESCs  for  every  use  of  this  resource.  We  introduce  an  extra  con¬ 
straint  to  force  all  these  instances  of  a  use  of  RESC  to  fire  on  the  same  phase,  but 
at  non-overlapping  times.  This  constraint  reflects  the  fact  that  two  instances  of 
RESC  should  not  occur  at  the  same  time,  but  will  however  occur  at  the  same 
phase  (but  different  cycles)  since  they  share  the  same  hardware  clock  line.  The 
constraints  can  also  be  used  to  describe  datapaths  with  fixed  loops.  These  loops 
correspond  to  a  block  being  re-used.  Such  loops  can  be  unrolled. 

•  Precharge  schemes:  fixed-phase  precharge  (the  same  phase  is  used  for  precharging 
all  the  gates),  per-gate  variable  precharge  phase,  and  intermediate  schemes  can  be 
described.  Chained  dynamic  and  domino  or  nora  gates  can  be  handled  too  [Weste 
85].  To  handle  chained  dynamic  gates,  we  create  constraints  to  specify  that  a  gate 
should  not  run  on  the  same  phases  as  a  gate  that  it  fans  out  to.  To  handle 
precharging,  we  first  create  a  'precharge’  block  for  each  precharged  gate.  This 
block  fires  when  the  associated  gate  is  being  precharged.  We  then  specify  that  a 
gate  G  and  its  "precharge’  gate  P  should  not  run  on  the  same  phase.  This  reflects 
the  fact  that  a  gate  should  not  be  precharged  while  it  is  being  used.  Moreover,  if 
all  the  gates  should  be  precharged  on  a  common  precharge  phase,  we  simply 
specify  that  all  these  'precharge’  blocks  should  run  off  a  common  (and  perhaps 
fixed)  phase. 

•  Delays  that  depend  on  the  input/output  pair:  such  gates  can  be  handled  by  adding 
dummy  "delay”  blocks. 

The  output  from  PA  therefore  specifies  exactly  when  each  gate  and  each  latch  will  be 
clocked;  this  output  precisely  defines  the  length  of  each  pipeline  stage.  PA  can  generate  a 
reservation  table  in  the  file  format  required  by  SCHED.  SCHED  can  then  calculate  an 
optimal  initiation  sequence  [Kogge  81]  for  the  pipeline. 
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FIG  25:  MODELLING  SHARED  RESOURCES  FOR  PA 


U.L«  LATCHES 

LOGIC  BLOCKS 
THE  SHARED  RESOURCE 

IS  SHARED  BY  STAGES  (L1.L2)  AND  (L3.L4) 


DATAPATH  LOOP 
RESULTING  FROM  RESOURCE 
SHARING 


(LI)  (L2)  (L3)  (L4) 


A  CORRECT  WAY  TO  MODEL  RESOURCE  SHARING  FOR  PA 
BY  UNROLLING  THE  DATAPATH  LOOP 


RESC  IS  DUPLICATED,  EACH  COPY 
CORRESPONDS  TO  ONE  USE  OF 
THE  ORIGINAL  RESC. 

CONSTRAINT:  RESC*1  AND  RESC42 
MUST  RUN  ON  THE  SAME  CLOCK  PHASE, 
BUT  DURING  DIFFERENT  CYCLES. 


(LI)  (L2)  (L3)  (L4) 
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Section  IV  -  Reservation  Table  Scheduling 

This  section  describes  a  scheduling  procedure  from  [Kogge  81]  that  determines  an  optimal 
initiation  sequence  for  a  pipeline.  The  throughput  of  a  pipe  does  not  depend  exclusively  on 
the  length  of  the  longest  stage;  it  depends  on  the  exact  pattern  of  stage  lengths,  that  is, 
how  many  clock  cycles  each  stage  requires.  After  PA,  we  know  this  exact  pattern.  We  can 
therefore  now  determine  exactly  when  new  data  should  be  allowed  to  enter  the  pipe  so  as  to 
maximize  the  average  processing  rate. 

It  can  be  quite  difficult  to  find  a  good  initiation  schedule  by  hand.  This  is  because  the 
pattern  of  stage  lengths  found  by  PA  can  be  irregular,  with  stages  requiring  different 
numbers  of  cycles  to  complete.  With  such  complex  patterns,  it  is  not  easy  to  manually 
determine  when  new  data  should  be  allowed  to  enter  the  pipe  in  order  to  maximize  the  pro¬ 
cessing  throughput.  PA  tells  us  how  long  each  stage  is;  however,  PA  does  not  produce  the 
optimal  new  data  entry  times. 

I.  Scheduler  Overview 

The  scheduler  takes  a  static  reservation  table  as  input.  It  calculates  an  initiation  sequence 
that  maximizes  the  throughput.  The  optimal  initiation  sequence  can  turn  out  to  be  fairly 
irregular;  it  need  not  be  cyclic  with  respect  to  the  compute  time  of  the  reservation  table. 

1.1.  Static  Reservation  Table 

The  initiation  sequence  specifies  when  new  data  should  enter  the  pipe.  The  input  to 
the  procedure  is  a  reservation  table,  which  specifies  which  pipeline  stages  are  used  by  a 
particular  computation,  for  how  long  and  in  what  sequence.  Two  examples  from  [Kogge  81] 
are  shown  in  Fig. 26. 
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Figure  26  Sample  reservation  tables 

1.2.  Reservation  Table  Generation 

An  exact  reservation  table,  that  takes  into  account  all  the  hardware  constraints,  can 
be  generated  by  PA  after  the  Phase  Assignment  step  has  been  done.  Given  this  reservation 
table  from  PA,  the  scheduler  will  try  to  avoid  collisions  in  order  to  maximize  the  pipe 
throughput.  The  performance  is  based  on  the  MAII  defined  in  the  next  section. 

1.3.  Latency 

A  key  parameter  in  determining  the  performance  of  a  pipeline  is  the  initiation  inter¬ 
val,  or  number  of  time  units  separating  two  initiations.  An  initiation  takes  place  when  a 
new  datum  enters  the  pipe.  Since  the  goal  of  a  pipe  is  to  maximize  throughput,  the  prime 
measure  of  actual  system  performance  is  the  initiation  rate,  or  average  number  of  initia- 
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tions  per  clock  unit.  The  average  initiation  interval  is  the  reciprocal  of  the  initiation  rate, 
or  the  average  number  of  time  units  between  two  initiations.  We  are  looking  for  the  MAII 
or  the  minimum  average  initiation  interval. 
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Fig.  27  sample  pipeline 

(a)  and  (b)  for  reservation  table  B 
(c)  and  (d)  for  reservation  table  A 


2.  Scheduling  Algorithm 
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2.1.  Strategy 

A  greedy  strategy  is  not  optimal  since,  if  a  datum  is  allowed  to  enter  the  pipe  too 
early,  this  may  block  further  initiations  and  actually  increase  the  MAII.  The  algorithm  we 
use  is  exhaustive  and  finds  the  best  solution.  The  algorithm  also  runs  fast  by  using  various 
techniques  to  effectively  prune  the  search  tree. 

A  bound  on  MAII  can  be  determined  from  the  reservation  table  alone  according  to  the 
following  lemma: 

2.1.1.  Lemma  (Shar  1972) 

For  any  statically  configured  pipeline  executing  some  reservation  table,  the  MAII  is 
always  greater  than  or  equal  to  the  maximum  number  of  marks  in  any  single  row  of  the 
reservation  table.  On  the  other  hand,  the  MAII  is  bounded  above  by  the  number  of  l’s  in 
the  initial  collision  vector  defined  in  next  section. 

Let  the  number  of  marks  in  the  ith  row  of  the  reservation  table  be  N{i). 

max (N{i))  <  MAII  <  Number  of  I's  in  initial  collision  vector 
In  Fig  26,  reservation  table  A  has  3  marks  in  rows  2  and  3,  and  consequently  its 
MAII  is  at  least  3.  Likewise  the  MAII  of  reservation  table  B  is  bounded  below  by  4.  This 
lower  bound  gives  the  scheduler  a  quick  estimate  of  the  maximum  performance  possible  for 
a  given  reservation  table.  Also,  any  scheduling  giving  a  MAII  larger  than  the  upper  bound 
can  be  discarded.  However,  there  is  no  guarantee  that  the  actual  MAII  equals  the  lower 
bound.  The  reason  is  that  small  latencies  may  cause  collisions  and  therefore  can  not  be 
used  in  an  initiation  sequence. 

3.  Data  Structures 

3.1.  State  Diagram 

The  state  diagram  [Kogge  81]  is  a  technique  to  rapidly  determine  if,  at  a  given  time, 
a  new  initiation  in  the  pipeline  will  conflict  with  any  previous  initiations.  At  each  time 
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unit,  the  current  pipeline  configuration  corresponds  to  one  of  the  states.  The  arcs  from  one 
state  to  the  next  indicates  what  new  state  the  pipeline  might  be  in  at  the  next  time  unit. 
All  possible  initiation  sequences  corresponds  to  paths  in  such  diagrams.  By  analyzing  all 
such  paths,  particularly  the  ones  that  form  closed  loops,  those  with  minimum  average  ini¬ 
tiation  interval  can  be  identified.  Fig.28  shows  a  state  diagram.  Each  box  represents  a 
state  and  contains  a  collision  vector.  The  collision  vector,  described  next,  shows  in  compact 
form  when  new  initiations  can  be  made  without  resource  conflicts  from  that  pipeline  state. 
The  arcs  in  Fig.28  show  how  the  pipe  can  change  states  by  making  initiations.  Each  arc  is 
labeled  with  the  number  of  clock  cycles  required  by  the  pipe  to  make  a  transition  from  the 
arc’s  source  state  to  its  sink  state. 

3-2.  Collision  Vector 

The  particular  information  encoded  into  each  state  is  termed  a  collision  vector.  This 
vector  is  a  d-bit  binary  sequence,  where  d  is  the  compute  time  of  the  reservation  table. 
The  d  bits  are  labeled  0  to  d  - 1  from  left  to  right,  with  a  0  in  position  i  indicating  that  a 
new  initiation  i  time  units  from  now  will  not  conflict  with  any  currently  uncompleted  ini¬ 
tiations.  A  1  indicates  that  a  collision  will  occur,  and  therefore  an  initiation  at  that  time 
must  be  avoided.  The  collision  vector  for  each  time  period  takes  into  account  whether  or 
not  a  new  initiation  was  made  in  that  period. 

The  collision  vector  for  the  initial  state  has  a  special  name  initial  collision  vector. 
Since  it  corresponds  to  the  time  unit  when  the  pipeline  is  first  started,  it  is  a  representation 
of  what  latencies  are  permissible  between  just  two  initiations,  one  at  time  0  and  one  at 
time  i. 

3.3.  Algorithm  For  Finding  The  Initial  Collision  Vector 


3.3.1.  Conceptual 
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for  i  =  0  to  d  —  1  do  { 

make  one  copy  of  the  reservation  table; 
shift  the  copy  to  the  right  i  times; 

OR-in  an  unshifted  copy  of  the  same  table; 

If  there  are  two  marks  in  the  same  stage-time  entry  anywhere 
bit  t  of  the  initial  collision  vector  =  1 
else 

bit  i  of  the  initial  collision  vector  =  0;  } 

In  all  cases  bit  0  of  this  collision  vector  is  1  because  overlaying  a  reservation  table  on 
itself  causes  a  collision  everywhere  there  is  a  mark.  Likewise  bit  positions  d  and  beyond 
are  always  0  because  the  shifted  and  nonshifted  tables  never  overlap. 


3.3.2.  Actual  Implementation 

An  alternative  approach  is  to  construct  the  forbidden  initiation  interval  set.  The 
number  i  is  a  member  of  this  set  when,  in  at  least  one  row  of  the  reservation  table  there 
are  two  marks  separated  by  i  columns.  Analysis  of  the  marks  in  each  row  quickly 
identifies  the  members  of  this  set.  We  notice  that  0  is  always  in  the  set.  Finally,  for  all  i 
in  the  set  the  corresponding  bit  of  the  initial  collision  vector  is  1.  All  other  bits  are  0. 

for  i  =  1  to  number  of  stage  do  l*  for  each  stage  */ 
for  j  =  1  to  compute-time  do  /*  for  each  time  */ 
if  (table[i][j]  is  marked) 

for  k  =  j  + 1  to  compute-time  do  /*  find  the  forbidden  set  */ 
if  (table[i][k]  is  marked) 

insert_into_forbidden_set(fc  —j); 

3.4.  State  Diagram 

Once  the  initial  state  of  the  pipeline  at  time  0  is  available,  the  equivalent  states  for 
all  future  times  may  be  computed. 

The  state  diagram  shows  the  relationship  between  states  at  consecutive  times. 
Because  the  problem  itself  is  NP-complete,  the  number  of  states  generated  may  be  large. 
Since  the  states  where  no  initiation  occurred  carry  no  information,  they  are  deleted  to  pro¬ 


duce  the  modified  state  diagram. 
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3.5.  Modified  State  Diagram 

Such  a  diagram  is  similar  to  the  original  diagram  but  includes  only  those  states 
resulting  from  new  initiations.  Two  states  in  this  new  diagram  are  connected  by  an  arc  if 
and  only  if  they  were  connected  by  some  series  of  arcs  in  the  original  diagram.  The 
number  attached  to  an  arc  is  the  interval  between  two  initiations.  Figure  28a  lists  the 
modified  diagram  for  reservation  table  B. 


3.5.1.  Algorithm  For  Generating  The  Modified  State  Diagram. 


Put  initial  state  with  initial  collision  vector  in  unprocessed  list; 
while  (Get_From_Unprocessed_List(state)  !=  EMPTY)  do  { 
for  each  k  such  that  the  kth  bit  of  the  collision  vector  is  0  { 
New-State  =  Left_Shift(current_state. collision-vector)  k  times; 
New-State  =  New_State  OR  Initial  collision  vector; 

If  New-State  already  exists 

Insert  arc  from  current-state  to  existing  state  with  value  k 
else  if  New-State  =  —  current-state 

Insert  arc  to  current— state  itself  with  value  k\ 
else 

Connect  New_State  to  state  with  an  arc  of  value  k ; 

Enter  New_State  to  unprocessed  list;  } 

Include  an  arc  with  value  d  from  each  state  back  to  the  initial  state; 
Insert  current-state  into  processed_list;  } 
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(a)  reservation  table  B 
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(b)  reservation  table  A 


Fig.  28  Modified  State  Diagram 

3.6.  Efficient  Search 

An  initiation  schedule  is  a  cyclic  sequence  of  initiations.  Each  initiation  corresponds 
to  an  arc  in  the  modified  state  diagram;  an  initiation  sequence  corresponds  to  a  cycle  in  the 
state  diagram.  The  initiation  sequence  must  contain  the  initial  state.  An  optimal  schedule 
is  an  initiation  sequence  such  that  the  average  time  separating  consecutive  initiations  in 
the  cycle  is  minimal. 

Simple  cycles  are  an  important  class  in  which  each  state  appears  no  more  than  once. 
Figure  28b  shows  the  modified  state  diagram  for  reservation  table  A.  The  initiation  inter¬ 
val  cycle  (3, 7, 5, 7)  is  not  simple  while  (3,5,7)  is.  The  utility  of  simple  cycles  comes  from  the 
following  lemma 

3.6.1.  Lemma  2  (Shar  1972) 

In  any  modified  state  diagram  if  there  is  a  cycle  with  an  average  initiation  interval  L, 
there  is  at  least  one  simple  cycle  with  average  initiation  interval  no  greater  than  L. 
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3.6.2.  Reducing  the  Search  Time 

The  above  lemma  allows  the  scheduling  algorithm  to  limit  its  search  for  optimal 
cycles  to  simple  cycles  because  it  guarantees  that  no  nonsimple  cycle  can  have  a  lower 
average  initiation  interval  than  one  of  the  simple  ones.  This  makes  an  exhaustive  search 
feasible. 

4.  Examples  and  Results 

4.1.  Simple  Pipeline  Machines 

The  scheduler  ran  in  a  fraction  of  a  second  on  most  small  examples  (less  than  10 
stages,  and  less  than  10  units  compute  time).  Most  of  the  time  is  spent  searching  the 
modified  state  diagram.  The  complexity  of  the  state  diagram  depends  on  the  possibility  of 
new  initiations.  This  is  directly  proportional  to 

Compute  Time  -  Number  of  Elements  in  the  Forbidden  set.  Therefore  the  less  dense  a 
reservation  table  is,  the  more  complicated  the  modified  state  diagram  is  going  to  be. 

Since  the  problem  is  NP  complete,  we  may  run  into  trouble  with  a  complicated  reser¬ 
vation  table  that  generates  many  states.  In  practice,  this  seldom  happens  because  new 
data  is  typically  introduced  frequently  into  the  pipe  and  instructions  usually  complete 
within  15  to  20  cycles.  If  an  instruction  requires  more  than  10  cycles,  usually  it  is  an  itera¬ 
tive  finite  state  machine  which  implies  the  reservation  table  is  very  dense.  Also,  there  are 
usually  fewer  than  15  stages.  Therefore  in  real  applications,  the  above  problem  seldom 

occurs. 

4.2.  More  Complicated  Machines 

An  artificial  reservation  table  with  compute  time  15  cycles  and  5  pipeline  stages  with 
the  worst  case  pattern  is  used  to  test  the  scheduler. 
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tern. 


%  time  scheduler  -t  15  -s  5  <  T5 
The  reservation  table  being  optimized: 

C0000C0000C0000 

OCOOOOCOOOOCOOC 

oocoooocoooocoo 

ooocoooocooooco 

oooocoooocooooc 

The  forbidden  initiation 
interval  set  contains  time  slot: 

0  3  5  8  10  13 

The  range  of  MAII  is  :  4  <  =  MAII  <  =  6 

The  optimal  sequence  is  as  follows  : 

(  2  4  12  ) 

The  order  of  states  are: 

110101001010010 
110101101011010 
111111111110010 
with  a  MAII  of  6. 

23. 9u  1.5s  0:46  55%  17  + 4102k  2  +  lio  Opf+Ow  -  VAX  8800  times 
This  example  causes  215  states  to  be  generated  and  corresponds  to  a  worst-case  pat- 

It  is  very  unlikely  to  find  a  hardware  configuration  that  would  result  in  such  a  table. 


5.  Control  Synthesis  Examples 

Appendix  4  shows  two  control  synthesis  examples,  using  time-stationary  and  data 
stationary  pipeline  control  schemes  [Kogge  81]. 


Conclusion 


This  report  describes  OPD,  a  set  of  co-ordinated  tools  whose  objective  is  to  help  designers 
produce  better  pipeline  structures  in  a  shorter  time.  OPD  spans  a  wide  range  of  design  lev¬ 
els,  starting  at  the  behavioral  level.  Because  OPD  has  the  ability  to  capture  most 
technology-specific  constraints,  it  remains  useful  right  down  to  the  final  datapath  and 
scheduling  design  steps. 

Good  pipeline  design  requires  the  solution  of  many  complex  inter-related  optimization 
problems.  We  have  developed,  coded  and  evaluated  various  heuristic  and  probabilistic  algo¬ 
rithms  for  solving  these  NP-complete  optimization  problems  within  OPD.  Inter¬ 
dependencies  are  handled  during  one  optimization  step  by  using  a  simplified  model  of  the 
related  optimization  steps. 

We  also  illustrate  how  OPD  works  in  conjunction  with  existing  CAD  tools  and  with 
the  designer.  The  Design  Methodology  and  the  Optimization  routines  were  verified  using 
several  small  and  large  examples. 
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Appendix  1  -  Input  Format  for  SP 

We  describe  the  ASCII  input  and  output  format  for  the  stage-partitioning  tool  SP.  We 
show  the  files  for  a  small  graph  shown  in  Fig. 3  and  for  the  HP21-MX  CPU  taken  from 
[Park  85]  shown  in  Fig.8.  The  main  purpose  of  this  ASCII  format  is  to  facilitate  the  debug¬ 
ging  of  SP.  The  format  is  therefore  easy  to  parse;  it  is  not  meant  to  serve  as  a  refined  input 
language. 

1.  Input  Format  for  SP 

The  input  to  SP  consists  of  LISP  expressions  of  the  form  (keyword  arguments  ...).  Each 
statement  is  implemented  as  a  LISP  function;  the  arguments  can  therefore  be  any  LISP 
expression.  There  are  four  possible  statements  described  below. 

•  The  nodes  statement 

This  defines  the  set  of  operations  (nodes)  of  the  dataflow  "traces”  that  the  system  can  exe¬ 
cute.  The  format  is: 

(nodes 

’(name-1  delay-1  [resource-type- 1]) 

’(name-2  delay-2  [resource-type-2]) 


) 

The  quotes  protect  the  arguments  from  evaluation  by  the  LISP  interpreter.  The 
resource-type  is  optional.  This  statement  declares  the  name  of  each  node  and  associates 
a  delay  and  an  [optional]  resource-type  with  each  node.  The  resource-type  is  used  to 
describe  resource  sharing;  the  number  of  nodes  with  a  given  resource-type  that  are  simul¬ 
taneously  active  must  not  exceed  the  number  of  resources  of  that  type  available  to  the 
system  (this  number  is  specified  by  the  resource-limits  statement,  below). 

•  The  resource-limits  statement 

This  sets  a  limit  on  the  number  of  resources  of  a  particular  type  available  to  the  system. 
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The  format  is: 
(resource-limits 

’(resource-type-1  limit-1) 


) 

Where  limit-i  is  the  number  of  units  of  type  resource-type-i  available  to  the  system. 
The  paths  statement 

serves  to  describe  one  dataflow  "trace’.  The  format  is: 

(paths 

’trace-name 

trace-probability-of-occurrence 
’(dst-1  sce-1-1  sce-1-2  ...  sce-l-Nl) 

’(dst-2  sce-2-1  ...  sce-2-N2) 


) 

This  statements  means  that  there  are  arcs  from  sce-1-2  to  dst-1,  from  sce-1-1  to  dst-1,  and 
so  on.  In  short,  the  fanin  arcs  of  dst-i  are  the  sce-i-1,  sce-i-2  ,  through  sce-i-Ni. 

The  bit-widths  statement 

optionally  declares  the  bit-width  of  the  arcs  connecting  various  nodes.  This  statement 
only  applies  to  2-point  nets  that  connect  the  output  from  one  block  to  the  input  of  one 
other  block.  The  format  is: 

(bit- widths 

number-1  ’node-source-1  ’node-sink-1 
number-2  ’node-source-2  ’node-sink-2 


) 

where  number-i  is  the  bitwidth  of  the  arc  joining  node-source-i  to  node-sink-i. 
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•  The  net-widths  statement 

optionally  declares  the  bit-width  of  arcs  connecting  various  nodes.  This  statement 
applies  to  multi-point  nets  that  connect  the  outputs  from  one  or  more  blocks  to  the  inputs 
of  one  or  more  blocks.  The  format  is: 

(net- widths 

number-1  ’((ns-1.1  nd-1.1)  (ns-1.2  nd-1.2)  ...  (ns-l.N  nd-l.N)) 
number-2  ’((ns-2.1  nd-2.1)  ...  (ns-2.N2  nd2.N2)) 


) 

where  number-J  is  the  bitwidth  of  net  #J,  which  is  defined  as  joining  the  outputs  of 
nodes  ns-J.l  through  ns-J.NJ  to  the  inputs  of  nodes  nd-J.l  through  nd-J.NJ.  This  state¬ 
ment  simultaneously  defines  the  multi-point  nets  and  their  bitwidth. 

2.  Examples 

We  now  show  the  input  files  to  SP  and  the  output  from  SP  for  two  examples.  The  first  one 
is  for  a  small  graph  shown  in  Fig. 3;  the  second  one  describes  the  HP21-MX  CPU  and  is 
taken  from  [Park  85],  shown  in  Fig. 8. 

The  purpose  of  these  examples  is  to  give  the  reader  a  feel  for  the  number  and  type  of 
interactions  involved  in  using  OPD. 


Contd  Appendix  1  -  Example  from  Fig.3 


Script  started  on  Wed  Feb  24  22:22:07  1988 
%  cat  nfig3.1 

;  This  example  is  the  one  illustrated  in  Figure  3. 
;  It  also  appears  in  Appendix  1. 


(nodes 

’(ml  100)  ’(m2  100)  ’(m3  100)  ’(pi  100)  ’(p2  100)  ’(dl  200) 

) 

(paths 

’tl  1.0  ’(m2  ml  m3)  ’(pi  ml  m2)  ’(p2  m3  m2)  ’(dl  pi  p2) 

) 

(make-sorder) 

(bit-widths  1  ’m2  ’p2  1  ’m2  ’pi  1  ’p2  ’dl  1  ’pi  ’dl) 
(net-widths  1  ’((m3  m2)  (m3  p2))  1  ’((ml  m2)  (ml  pi))) 

%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20.5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 

=  >  (load  ’nfig3) 

;;  Loading  file  "nfig3.1" 
gen-node:  WARN  -  node  S  delay  =  0 
gen-node:  WARN  -  node  E  delay  =  0 
t 

=  >  (setq  partition  (optO  200)) 

—  NOTE:  200  is  the  target  stage  length. 

(wa!12  wall3  wall4  wall5) 

=  >  (optl  partition) 
curcost  =  202 
202 

=  >  (mapc  ’print-wall  partition) 

—  NOTE:  wall2  is  a  stage  latch;  the  format  is: 

—  NOTE:  arc-name  source-node  sink-node 
WALL  wall2: 


arcl9.sorder 

S->m3 

arcl7.sorder 

S->ml 

WALL  wall3: 

arc8.sorder 

ml->pl 

arcl2.sorder 

m3->p2 

arcl3.sorder 

m3- >  m2 

arc9.sorder 

ml->m2 

WALL  wall4: 

arcl4.sorder 

pl->dl 

arcl5.sorder 

p2->dl 

WALL  wall5: 

arc21.sorder 

dl->E 

(wall2  wall3  wall4  wall5) 
=  >  (print-part  partition) 

Stage  wall2->wall3 
(m3  ml) 
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Stage  wall3->wall4 
(pi  p2  m2) 

Stage  wall4->wal!5 

(dl) 

((wall2  wall3)  (wal!3  wall4)  (wal!4  wall5)) 
=  >  (exit) 

%  *D 

script  done  on  Wed  Feb  24  22:24:02  1988 
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Contd  Appendix  1  -  HP21-MX  CPU  Example  (Fig.8) 


Script  started  on  Mon  Nov  23  10:44:03  1987 
% 

% 

%  cat  hp21mx.l 

-  NOTE:  This  is  the  HP21-MX  Example.  See  Fig.8. 


dataflow  graph  for  the  HP-21MX  CPU 
See  also  Fig.8  please. 

From  N.  Park’s  Thesis,  p.  82 
Tl  corresponds  to  non-branch  microcycles; 
T2  describes  branches. 


(nodes 
’(S  0) 
’(A  10) 
’(B  70) 
’(C  15) 
’(D  20) 
’(F  15) 
’(G  15) 
’(H  20) 
’(I  25) 
’(J  65) 
’(K  20) 
’(L  10) 
’(M  50) 
’(E  0) 

) 


(paths 

’Tl  ;  non-branch  microcycles 

0.1 
’(A  S) 

’(B  A) 

’(C  B) 

’(D  B  C) 

’(F  B  C) 

’(G  B  C) 

’(H  B  C) 

’(I  D) 

’(J  F  I) 

'(K  G  J) 

’(L  H  K) 

’(E  L) 

) 


(paths 
T2 
0.9 
’(A  S) 


;  branch  microcycles 
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’(B  A) 

*(C  B) 

’(D  B  C) 
’(F  B  C) 
’(H  B  C) 
’(M  F  D) 
’(L  M  H) 
’(E  L) 

) 


% 

% 

% 

%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20.5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 

=  >  (load  ’hp21mx] 

;;  Loading  file  "hp21mx.l" 
gen-node:  WARN  -  node  S  delay  =  0 
gen-node:  WARN  -  node  E  delay  =  0 
t 

--  NOTE  The  next  statement  sets  up  a  seed  partition  The  "wall”  (stage-latch) 

-  NOTE  W  is  created  such  that  its  fanin  nodes  are  B  and  C. 

=  >  (walls  ’(W  B  C] 

((W  B  C)) 

=  >  (trace  cost]  , 

-  NOTE:  we  trace  the  cost  function  to  show  you  how  often  it  is  called  during  optimization. 

(cost) 

=  >  (optl  (list  W] 

-  NOTE:  this  optimizes  the  seed  partition  above 
1  <Enter>  cost  ((wall2)) 

1  <EXIT>  cost  155 
1  <Enter>  cost  ((wall2)) 

1  <EXIT>  cost  225 
1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  155 
1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  140 
curcost  =140 
1  <Enter>  cost  ((wal!2)) 

1  <EXIT>  cost  155 
1  <Enter>  cost  ((wall2)) 

1  <EXIT>  cost  155 
1  <  Enter >  cost  ((wall2)) 

1  <EXIT>  cost  155 
1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  155 
140 

=  >  (optl  (list  W] 

1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  140 
1  <  Enter >  cost  ((wall2)) 


1  <EXIT>  cost  155 
1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  120 
curcost  =120 
1  <  Enter  >  cost  ((wal!2)) 

1  <EXIT>  cost  140 
1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  140 
1  <  Enter  >  cost  ((wall2)) 

1  <EXIT>  cost  140 
120 

--  NOTE:  optimized  max  stage  length  =  120ns. 


=  >  (print-wall  W] 
WALL  wall2: 

arc37 

B->F 

arc36 

B->G 

arc35 

B->H 

arc42 

C->F 

arc41 

C->G 

arc40 

C->H 

arc52 

D->M 

arc44 

D->I 

-  NOTE:  Although  this  is  not  the  same  Partition  as  in  [Park  85],  it  is 

-  NOTE:  just  as  good  (same  cycle  time).  Fig. 8  shows  the  partition  from  [Park  85], 

(arc37  arc36  arc35  arc42  arc41  arc40  arc52  arc44) 

=  >  (setq  part-IV  (optO  130] 

--  NOTE:  we  are  going  to  calculate 
(wall87  wall88  wall89) 

-  NOTE:  a  four-stage  partition. 

=  >  (optl  part-IV] 


curcost  =  115 
115 

-  NOTE:  a  rough  4-stage  partition  (cycle  =  115ns) 
=  >  (optl  part-IV] 


curcost  =  115 
115 

-  NOTE:  Optl  has  now  gotten  stuck  in  a  local  minimum 

-  NOTE:  with  a  cycle  time  of  95ns  for  a  4-stage 

-  NOTE:  partition. 

-  NOTE:  so  we  are  going  to  run  Opt2  to  escape  from  this  local  minimum. 
=  >  (opt2  part-IV] 


110 

-  NOTE:  found  a  better  4-stage  partition  (cycle  =  110ns) 
=  >  (optl  part-IV] 


8 


95 

-  NOTE:  optl  has  a  still  better  4-stage  partition. 
=  >  (optl  part-IV] 


95 


=  >  (mapc  ’print-wall  part-IV] 


WALL  wall87: 

arc33 

S->  A 

WALL  wall88: 

arc41 

C->G 

arc36 

B->G 

arc40 

C->H 

arc35 

B->H 

arc42 

C->F 

arc37 

B->F 

arc43 

C->D 

arc38 

B->D 

WALL  wall89: 

arc53 

F->M 

arc52 

D->M 

arc46 

G->K 

arc47 

H->L 

arc48 

I->J 

arc45 

F->J 

(wall87  wall88  wall89) 

=  > 

Exiting... 

% 

script  done  on  Mon  Nov  23  11:06:35  1987 


Appendix  2  -  PA  input  format,  examples 


1.  Tool  Input  Description 

The  PA  input  format  is  meant  to  be  an  easily  parsed  description  syntax.  Each  statement  is 
a  LISP  expression  and  is  implemented  by  a  Franz  LISP  function;  the  arguments  can  there¬ 
fore  be  arbitrary  LISP  expressions. 

The  input  format  consists  of  five  items:  a  clock  specification,  a  description  of  the  blocks  and 
latches,  a  description  of  the  connectivity  among  blocks,  a  set  of  constraints,  and  a  descrip¬ 
tion  of  which  latches  are  actually  used  for  pipeline  staging  (MOS  designs  sometimes  use 
latches  to  temporarily  hold  results;  these  latches  do  not  define  stage  boundaries). 

1.1.  Clock  Specification 

A  clock  cycle  is  assumed  to  consist  of  a  repeating  set  of  phases.  The  length  of  each  phase  is 
specified  via  the  phases  keyword: 

(phases  length-of-phase-1  ...  length-of-phase-N) 

Where  the  cycle  has  N  phases.  The  time  units  used  have  no  intrinsic  meaning  to  the 
tool.  However,  the  iterative  optimization  algorithms  use  a  basic  increment  equal  to  one 
time  unit  for  exploring  alternatives.  With  a  finer  unit,  we  can  therefore  obtain  greater  pre¬ 
cision  at  the  expense  of  increased  run  times. 

1.2.  Block  Description 

For  each  block,  we  specify  three  items:  a  name,  the  block’s  type,  and  its  propagation  delay: 

(blocks  ’(block-namel  block-typel  block-delayl)  ...  ) 

Known  types  are  STATIC,  DYNAMIC,  LATCH.  A  STATIC  block  corresponds  to  combina¬ 
tional  logic  or  external  blocks  introducing  delay.  Such  a  block  does  not  need  to  run  on  any 
particular  phase;  it  starts  computing  outputs  as  soon  as  the  input  data  is  stable.  A 
DYNAMIC  block  corresponds  to  dynamic  or  nora  or  domino  logic  gates.  The  phase  on 
which  such  gates  run  must  open  only  after  all  the  input  data  has  settled.  In  order  to 
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distinguish  true  dynamic  gates  from  domino  and  nora,  we  give  the  tool  an  extra  constraint 
to  make  sure  that  no  chained  dynamic  gates  will  be  assigned  to  the  same  clock  phase.  A 
LATCH  may  or  may  not  be  used  for  staging.  The  difference  between  a  LATCH  and  other 
gates  is  that  the  phase  on  which  LATCHes  run  can  open  before  the  input  data  has  settled; 
however,  it  must  close  only  after  the  data  is  valid. 

The  delay  associated  with  a  block  corresponds  to  a  worst  case  all-pairs  input  to  output 
propagation  delay.  Delays  that  depend  on  the  particular  input/output  pair  can  be  handled 
by  mapping  each  block  to  a  set  of  parallel  "dummy”  blocks,  with  one  "dummy”  per 
input/output  pair. 

The  tool  makes  a  few  assumptions  about  the  blocks.  By  default,  gates  are  assumed  to 
be  able  to  freeze  their  output;  this  usually  means  there  has  to  be  an  enable  line  or  some 
logic  on  the  clock.  It  is  possible  to  set  up  constraints  to  model  dynamic  gates  that  loose  out¬ 
put  when  precharged. 

1.3.  Connectivity  Description 

This  specifies  the  path  that  a  datum  follows  through  the  gates  and  latches  in  order  to  get 
the  computation  done.  The  path  keyword  is  used: 

(path  ’(destl  scel.l  seel. 2  ...  scel.Nl)  ...  ) 

Here,  destl  and  scel.x  are  block  names.  The  construct  means  that  the  output  from  blocks 
scel.l  through  scel.Nl  feed  block  destl.  In  other  words,  the  path  keyword  is  a  convenient 
way  to  specify  the  connectivity  graph  of  the  datapath.  This  graph  must  be  a  DAG.  There 
can  be  parallel  paths,  but  there  must  be  no  loops.  Parallel  paths  are  scheduled  for  worst- 
case.  It  is  possible  to  deal  with  shared  resources  and  fixed  loops  by  unrolling  the  loops  and 
adding  extra  constraints. 

1.4.  Constraints  Specification 


The  general  input  syntax  is: 
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(constraints  ’(type  blockl  ...  blockN)  ...  ) 

The  tool  can  handle  five  different  constraints: 

•  EQPHI:  Equal  Phase  constraint. 

The  syntax  is  (EQPHI  blockl  ...  blockN).  This  constraint  specifies  that  the  given  blocks 
should  be  run  on  the  same  clock  phase.  The  blocks  need  not  necessarily  run  at  the  same 
time  (they  can  be  off  by  an  integer  number  of  cycles).  This  constraint  is  typically  used 
when  gates  share  a  common  clock  line,  or  when  we  want  to  force  all  the  dynamic  gates  to 
be  precharged  on  the  same  phase(s). 

•  FPHI:  Fixed  Phase  constraint. 

Syntax  (FPHI  block-name  phase-number).  Specifies  that  the  given  block  should  be  run 
on  the  given  phase.  This  is  useful  for  precharge  schemes,  as  well  as  for  handling  exter¬ 
nal  constraints,  such  as  inputs  arriving  on  a  know  phase.  Phases  are  numbered  starting 
with  0. 

•  NEQPHI:  Non_equal  phases  constraint. 

Syntax  (NEQPHI  blockl  ...  blockN).  Specifies  that  the  phases  assigned  to  the  given 
blocks  should  be  pairwise  distinct. 

•  NVPHI:  No_overlap  "work-phases”  constraint. 

Syntax  (NVPHI  blockl  ...  blockN).  Specifies  that  the  given  blocks  should  run  off  a  set  of 
phases  that  are  pairwise  disjoint.  This  constraint  is  primarily  used  to  describe  resource 
sharing  within  a  stage.  This  is  also  useful  to  make  sure  that  there  are  no  open  paths  in 
the  stage  latches,  or  to  handle  chained  dynamic  gates  (which  should  not  run.  off  common 
phases). 

•  NVT:  No_overlap  "work-times”  constraint. 

Syntax  (NVT  blockl  ..  blockN).  Specifies  that  the  work  times  of  the  given  blocks  should 
be  pairwise  disjoint.  This  constraint  is  used  to  describe  resources  that  are  shared  across 
pipe  stages.  Each  stage  locks  the  resource  for  the  duration  of  that  stage.  The  work  time 
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of  a  block  is  defined  as  the  time  during  which  the  result  from  that  block  is  needed  to 
ensure  correct  operation  of  the  pipeline.  The  work  time  of  a  block  therefore  begins  when 
the  stage  latch  that  holds  its  input  data  opens.  The  work  time  ends  when  the  output 
from  the  block  gets  effectively  latched  at  the  end  of  the  stage.  This  constraint  describes 
resource  sharing  between  stages  and  can  also  describe  unrolled  loops.  A  shared  resource 
RESC  is  a  block  that  is  used  twice,  generally  in  different  pipeline  stages.  As  Fig.25 
shows,  we  can  not  model  such  a  situation  with  only  one  copy  of  block  RESC,  since  this 
would  be  interpreted  to  mean  that  RESC  must  wait  for  both  stages  to  complete,  and  this 
would  cause  a  loop  in  the  connectivity  graph.  We  need  to  duplicate  RESC  into  two 
blocks  RESC1  and  RESC2,  one  duplicate  per  utilization.  Since  RESCl  and  RESC2  are 
in  fact  implemented  by  the  same  physical  gates,  they  can  not  be  used  simultaneously. 
This  is  what  the  constraint  (NVT  RESCl  RESC2)  captures.  We  can  then  be  sure  that 
the  phase  assignment  and  schedule  produced  will  be  implementable  with  only  one  physi¬ 
cal  resource  RESC. 

Four  of  the  five  constraints  are  therefore  concerned  with  phase  assignment;  the  fifth,  NVT, 
is  a  scheduling  constraint. 

1.5.  Stage  Latches  Description 

We  simply  input  a  list  of  latch  pairs  that  are  to  be  used  for  staging.  The  syntax  is: 

(stages  ’(stagel-latch-start  stagel-latch-end)  ...  ) 

We  now  show  the  input  to  PA  and  the  output  from  PA  for  two  examples.  The  first  is  a 
small  example  which  also  illustrates  PC,  the  phase  calculation  step;  the  second  describes 
the  section  of  the  SPUR  [Patterson  87]  [Hill  85]  Floating  Point  Unit  [Adams  86]  datapath 
which  manipulates  the  fractional  part  of  floating  data.  The  SPUR  FPU  has  about  40 
blocks.  We  describe  the  FPU  fraction  datapath  as  it  is  used  to  perform  the  ADD  instruc¬ 
tion.  Since  the  FPU  is  a  datapath  chip  with  few  opportunities  for  scheduling  optimizations, 
the  resulting  timing  is  very  close  to  that  used  by  the  designers.  There  is  a  layer  of  logic 
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between  the  clocks  and  the  latch  controls  in  the  FPU.  It  is  therefore  potentially  possible  to 
change  the  phase  assignment  dynamically,  to  adjust  to  the  particular  instruction  or  instruc¬ 
tion  mix  being  executed. 
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Contd  Appendix  2  -  Tutorial  example  from  Fig.4 

Script  started  on  Sun  Nov  22  19:50:41  1987 
%  cat  nfig4.1 

;  Please  refer  to  Fig.4 

;  This  example  illustrates  Both  SP  and  PA/PC 
;  on  the  system  show  in  Fig.4 

(nodes 
’(S  0) 

’(Al  1  rl) 

’(A2  1  rl) 

’(A3  1) 

’(Bl  1) 

’(B2  2) 

’(B3  1) 

’(B4  1) 

’(Cl  2) 

’(C2  2) 

’(C3  2) 

’(C4  2) 

’(E  0) 

) 

(resource-limits 
’(rl  1) 

) 

(paths 

’Tl 

1.0 

’(Al  S) 

’(A2  S) 

’(A3  S) 

’(Bl  Al) 

’(B2  Al  A2) 

’(B3  A2  A3) 

’(B4  A3) 

’(Cl  Bl) 

’(C2  B2) 

’(C3  B3) 

’(C4  B4) 

’(E  Cl  C2  C3  C4) 

) 

(setq  part  (optO  2)) 

%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20. 5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 
=  >  (load  ’nfig4) 

;;  Loading  file  "nfig4  1" 
gen-node:  WARN  -  node  S  delay  =  0 
gen-node:  WARN  -  node  E  delay  =  0 
t 


> 
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=  >  (setq  part  (optO  2)) 

(wall7  wall8  wal!9  walllO) 

=  >  (optl  part) 
curcost  =  2 
2 

-  NOTE:  Best  SP  Partition  cycle  time  =  2 
=  >  (mapc  ’print-wall  part) 

WALL  wa!17: 


arc  18 

S->  A2 

arcl7 

S->A3 

arc  21 

A1->B1 

arc20 

A1->B2 

WALL  wall8: 

arc23 

A2->B2 

arc20 

A1->B2 

arc26 

Bl->C1 

arc28 

B3->C3 

arc29 

B4->C4 

WALL  wall9: 

arc27 

B2->C2 

arc30 

C1->E 

arc32 

C3->E 

arc33 

C4->E 

WALL  walllO: 

arc33 

C4->E 

arc32 

C3->E 

arc30 

C1->E 

arc31 

C2->E 

(wall7  wall8  wall9  walllO) 

-  NOTE:  We  now  run  PC  to  break  up  this  cycle  =  2units  into  phases. 

-  NOTE:  the  following  step  produces  the  FOLDED  graph  as  shown  in  Fig.4 

=  >  (go-phase-calc  part] 
data  path  file  name  ? 

-  NOTE:  the  datapath  is  provided  by  the  designer. 
nfig4.dp.l 

(setq  datapath-file  ’nfig4.dp.l) 
mini  #phases  ? 

„  NOTE:  We  give  one  as  minimal  number  of  phases. 

1 

temp  file  for  phase  parms  ? 
nfig4.phases-temp.l 

-  NOTE:  this  file  holds  the  FOLDED  graph  (see  Fig.4) 

%  cat  nfig4.phases-temp.l 

(setq  datapath-file  ’nfig4.dp.l) 

(setq  GLphase-calc ’t) 

(setq  PCminiphis  T) 

(nodes 
’(S  0) 

’(Al  1  rl) 

’(A2  1  rl) 
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’(A3  1) 
’(Bl  1) 
’(B2  2) 
’(B3  1) 
’(B4  1) 
’(Cl  2) 
’(C2  2) 
’(C3  2) 
’(C4  2) 
’(E  0) 

) 


(resource-limits 
’(rl  1) 

) 


(paths 

’Pi 

1.0 

’(Al  S) 

’(A2  S) 

’(A3  S) 

’(Bl  S) 

’(B2  S) 

’(B3  A3  A2) 

’(B4  A3) 

’(Cl  S) 

’(C2  S) 

’(C3  S) 

’(C4  S) 

’(E  C4  C3  C2  Cl  B4  B3  B2  Bl  Al) 

) 


-  NOTE:  we  are  now  going  to  run  SP  on  the  FOLDED  graph 
--  NOTE:  to  determine  the  phase  lengths.  See  Fig.4. 

%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20.5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 

=  >  (load  ’nfig4.phases-temp) 

;;  Loading  file  "nfig4.phases-temp.l" 
gen-node:  WARN  -  node  S  delay  =  0 
gen-node:  WARN  -  node  E  delay  =  0 
t 

=  >  (setq  phase-part  (optO  2)) 

(wall2  wal!3) 

=  >  (optl  phase-part) 

-  NOTE:  SP  on  the  FOLDED  graph  calls  PA  to  actually  schedule  the 

-  NOTE:  datapath  with  the  phase  length  sequence  under  evaluation. 

phase-calc-cost:  (exec. refine  nfig4.dp.l  2) 


I>  PAchk  [clock. 1]:  CPU  time  for  preprocessing:  0.69  seconds 
;;  Loading  file  "result.refine.l" 
phase-calc-cost  7 
7 

-  NOTE:  this  is  the  result  from  PA. 


-  NOTE:  The  best  split  found  is  with  just  one  phase. 

c 

=  > 

Exiting... 

% 

script  done  on  Sun  Nov  22  20:02:20  1987 


APPENDIX  2 

FIG  A2.1:  SPUR  FPU  “FRACTION  DATAPATH” 


prartion  Datapath 

Puli 


3o»B 


BuiA 
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Contd  Appendix  2  -  FPU  Example 


We  now  show  the  SPUR  FPU  example.  What  follows  is  the  input  file  that  describes  the 
FPU  "Fraction”  Datapath,  used  to  produce  the  energy  curve  of  Fig. 21. 


This  version  has  an  extra  latch  at  the  output  of  the  Is  detector 
to  make  the  pipeline  function  correctly. 


;  (flags  debug_poke  debug-greedy) 

(phases  35  35  35  35) 

(blocks 

’(pads:0  LATCH  0) 

;  input  pad 
’(pads:l  STATIC  0) 

;  output  pad 

;  no  NVT  constraint  here:  they  are  different  pins. 

’(RegDecoder  DYNAMIC  9) 

’(DataI/O&UnpackConvert:0  STATIC  48) 

;  copy  for  input  data  unpacking 
’(DataI/0&UnpackConvert:l  STATIC  48) 

;  for  packing  output  result 

’(RegisterFile:0  LATCH  15) 

;  copy  for  writing  out  data 
’(RegisterFile.T  LATCH  15) 

;  for  reading  back  result 

’(BusDriver:0  STATIC  9) 

;  to  load  in  data 
’(BusDriverrl  STATIC  9) 

;  write  back  result 

’(BusA:0  STATIC  0) 

’(BusB:0  STATIC  0) 

;  copy  fed  by  input  bus  driver 
’(BusB:l  STATIC  0) 

;  copy  fed  by  incrementer  output  latch 
’(BusB:2  STATIC  0) 

;  copy  used  for  writing  back  result 

’(ALatch  LATCH  5) 

’(BLatch  LATCH  5) 

’(MuxFG  STATIC  12) 


’(ExponentBox:0  STATIC  120) 


’(ExponentBox:l  STATIC  10) 

;  ;  fed  by  ones  detector. 

;  Shorter  delay  since  already  set  up. 

’(MuxA  STATIC  12) 

’(OpALatch  STATIC  5) 

’(R&LShifter:0  DYNAMIC  16) 

;  Right  Shift  for  aligning  fractional  part 
’(R&LShifter:l  DYNAMIC  16) 

;  Left  Shift  for  normalizing  fractional  part 

’(RShiftOutLatch  LATCH  5) 

’(MuxB  STATIC  12) 

’(OpBLatch  LATCH  5) 

’(Adder  STATIC  37) 

’(IntLatch  LATCH  20) 

’(Complement  STATIC  5) 

’(TestRlPLl  STATIC  10) 

’(RlPLlShift  STATIC  12) 

’(RoundPLA  STATIC  10) 

’(63bitlnc  STATIC  37) 

’(IncOutLatch  LATCH  20) 

’(lsDetector  DYNAMIC  20) 

’(DetectorLatch  LATCH  5) 

’(LShiftlnLatch  LATCH  5) 

’(LShiftOutLatch  LATCH  20) 

) 


(path 

’(RegDecoder  pads:0) 

’(DataI/O&UnpackConvert:0  pads:0) 

’(RegisterFilerO  RegDecoder  DataI/O&UnpackConvert:0) 
’(BusDriver:0  RegisterFile:0) 

’(BusA:0  BusDriver:0) 

’(BusB:0  BusDriver:0) 

’(ALatch  BusA:0) 

’(BLatch  BusB:0) 

’(MuxFG  ALatch  BLatch  ExponentBox.O) 
’(ExponentBox:0  pads:0) 

’(MuxA  MuxFG  BusA:0) 

’(OpALatch  MuxA) 

’(R&LShifter:0  MuxFG  ExponentBox:0) 
’(RShiftOutLatch  R&LShifter:0) 

’(MuxB  BusB:0  RShiftOutLatch) 

’(OpBLatch  MuxB) 

’(Adder  OpALatch  OpBLatch) 


> 

;  This  concludes  the  section  of  the  pipeline  that  feeds  the  adder. 

;  The  following  part  gets  the  intermediate  result  and  loads  it  back  into 
;  the  register  file  or  the  pads. 


’(IntLatch  Adder) 
’(Complement  IntLatch) 


’(TestRlPLl  Complement) 

’(RlPLlShift  TestRlPLl  Complement) 

’(RoundPLA  RlPLlShift) 

’(63bitlnc  RoundPLA  RlPLlShift) 

’(IncOutLatch  63bitlnc) 

’(BusB:l  IncOutLatch) 

’(lsDetector  BusB:l) 

’(DetectorLatch  lsDetector) 

’(ExponentBox:l  DetectorLatch) 

’(LShiftlnLatch  BusB:l) 

’(R&LShifter:l  LShiftlnLatch  DetectorLatch  ExponentBoxrl) 
’(LShiftOutLatch  R&LShifter:l) 

’(BusB:2  LShiftOutLatch) 

’(BusDriver:l  BusB:2) 

’(RegisterFilerl  BusDriver:l  RegDecoder) 
’(DataI/0&UnpackConvert:l  RegisterFile:l) 

’(pads:l  DataI/0&UnpackConvert:l) 

) 

(constraints 

;  make  sure  dynamic  gates  fire  only  after  input  has  stabilized. 
’(NVPHI  RegDecoder  pads:0) 

’(NVPHI  R&LShifter:0  ExponentBox:0) 

’(NVPHI  R&LShifter:0  MuxFG) 

’(NVPHI  lsDetector  BusB:l) 

’(NVPHI  R&LShifter:l  ExponentBox:l) 

’(NVPHI  R&LShifter.l  LShiftlnLatch) 

’(NVT  BusB:0  BusB:l) 

’(NVT  BusB:0  BusB:2) 

’(NVT  BusB:l  BusB:2) 

’(NVT  R&LShifter:0  R&LShifter:l) 

’(NVT  BusDriver:0  BusDriver:l) 

’(NVT  RegisterFileiO  RegisterFile:l) 

’(NVT  DataI/O&UnpackConvert:0  DataI/0&UnpackConvert:l) 

) 

(stages 

’(pads:0  ALatch) 

’(pads:0  BLatch) 

’(ALatch  OpALatch) 

’(BLatch  OpB Latch) 

’(OpALatch  IntLatch) 

’(OpBLatch  IntLatch) 

’(IntLatch  IncOutLatch) 

’(IncOutLatch  LShiftOutLatch) 

’(LShiftOutLatch  pads:l) 

) 


The  raw  text  output  from  PA  for  the  FPU  "Fraction”  Datapath  follows. 
Block _ Fire  Time 

The  Fire  Time  is  in  the  form:  cycle#.phase#. offset  time 


pads:0 _ -1.3.35 

RegDecoder - 0.0.0 

DataI/O&UnpackConvert:0 - 0.0.0 

ExponentBox.O - 0.0.0 

RegisterFileiO - 0.1.35 
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BusDriver:0_ 

BusB:0 _ 

BusA:0 _ 

ALatch _ 

B  Latch _ 


MuxFG _ 

MuxA _ 

R&LShifter:0_ 

OpALatch _ 

RShiftOutLatch- 

MuxB _ 

OpB  Latch _ 

Adder _ 

IntLatch _ 

Complement- 

TestRlPLl _ 

RlPLlShift — 
RoundPLA- 

|63bitlnc| _ 

IncOutLatch _ 

BusB:l _ 

|lsDetector|_ 


LShiftlnLatch _ 

Detector  Latch _ 

ExponentBox:l_ 

R&LShifter:l_ 

LShiftOutLatch. 

BusB:2 _ 

BusDriver:l — 
RegisterFile:!. 


_ 0.2.15 

_0.2.24 

-0.2.24 

-0.2.24 

-0.2.24 

—0.3.15 

-0.3.27 

_ 1.0.0 

_ 1.0.4 

_ 1.0.16 


_1.0.21 

_ 1.0.33 

.1.1.3 
-1.2.5 
_ 1.2.25 


_ 1.2.30 

_ 1.3.5 

_ 1.3.17 

.1.3.27 

_ 2.0.29 

_2.1.14 

—2.2.0 

_ 2.1.14 

_ 2.2.20 

_ 2.2.25 

_ 2.3.0 

_ 2.3.16 


-3.0.1 

—3.0.1 

—3.0.10 


Datal/O&UnpackConvertl - 

pads:l - 3.2.3 


3.0.25 
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Appendix  3  -  Stage  Partitioning  and  Phase  Length  Calculation 

for  the  RISC-II 

We  first  run  SP  on  a  dataflow  graph  based  on  the  RISC-II  CPU  [Katevenis  83].  Next,  we 
run  PC  with  this  partitioning  and  a  RISC-II  based  datapath.  Finally,  assuming  we  want 
four  phases  of  equal  length,  we  calculate  the  best  phase  length  to  sequence  the  RISC-II 
datapath  by  trying  out  a  range  of  values. 

This  example  illustrates  how  the  pipe  stage  lengths  are  related  to  the  phase  lengths. 
In  fact,  we  will  see  that  this  relationship  is  very  tenuous.  Even  though  we  may  be  able  to 
find  a  good  stage  partitioning,  and  to  do  a  good  phase  assignment  with  designer-chosen 
phase  lengths,  it  remains  difficult  to  calculate  these  phase  lengths  from  the  dataflow  graph 
and  the  datapath. 

The  RISC-II  example  has  about  30  blocks.  In  this  case,  using  the  designer’s  phase 
length  choice,  PA  assigned  phases  to  datapath  blocks  in  a  way  that  is  very  close  to  what 
the  designers  did,  the  only  difference  being  in  the  reference  -  which  phase  is  really  the  first 
one  ?  This  comes  as  no  surprise,  because  the  designers  of  the  RISC-II  chose  the  phase 
lengths  so  as  to  satisfy  the  timing  dependencies  of  the  chip  in  an  optimal  fashion  (see 
[Katevenis  83]  chap.  4.2  pp  102-112).  By  trying  out  different  lengths  for  the  phases,  we 
were  able  to  confirm  that  the  chosen  length  for  each  one  of  the  four  phases  -  120 
nanoseconds  -  is  very  close  to  the  optimal  length.  The  pipe  throughput  decreased  for  other 
phase  lengths.  In  fact,  we  found  the  optimum  to  be  at  110ns,  but  the  small  difference  -  less 
than  10%  -  can  be  accounted  for  by  uncertainty  on  delay  figures  for  the  blocks. 
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Contd  Appendix  3  -  RISC-II  Example 


Script  started  on  Mon  Nov  23  11:29:43  198 
% 

% 

%  cat  r21-dfg.l 

;  Example  inspired  by  the  RISC-II  Micro-architecture 
;  (cf.  M.  Katevenis’  Thesis). 

ty 

;  This  version  has  only  one  datapath  for  all  instructions, 

;  and  therefore 
;  covers  the  worst-case  paths. 

;  This  is  the  DFG  corresponding  to  the  actual  datapath  in  r21. 


(nodes 
’(S  1)  ’(E  1) 

t 

;  First  I-Fetch 

’(nxtpc  1) 

;  PC  for  instr  access 
’(dmux_0  1  DMUX) 

;  latching  multiplexor  for  memory  address 
’(pads_0  70) 

;  delay  to  pins 
’(mem_0  300  MEM) 

;  Memory  behaves  like  a  dynamic-  fixed  work  phases 
’(ir  1) 

;  Instruction  Register  -  abstraction  for  RD&IMM&OP 

1 

;  Register  File  components 
’(rr-dec  90) 

;  Register  Decode  -  number  of  reg  accessed  for  read 
’(rw-dec_0  90) 

;  Register  Decode  -  number  of  reg  accessed  for  write 
’(rw-dec_l  90) 

;  Register  Decode  -  number  of  reg  accessed  for  write 
’(rr  160) 

;  Reg  Read + precharge 
’(rw_0  140  RW) 

;  Reg  Write  -  r/r  instr 
’(rw_l  140  RW) 

;  Reg  Write  -  load 

;  ALU  and  surrounding  latches 

’(alu  170  ALU) 

’(alu-dum_0  170  ALU) 

’(alu-dum_l  170  ALU) 

’(ai  1) 


106 


’(bi  1) 

’(dst_0  1  DST) 

;  DST  used  to  hold  ALU  output 
’(dst_l  1  DST) 

;  DST  used  to  hold  data  from  memory  to  write  in  RF 
’(src  1) 

;  to  hold  reg  file  output 

f 

;  Dynamic  Shifter 

’(shift_0  40  SHIFT) 

;  Shifter  used  for  a  shift  instruction 
’(shift_l  40  SHIFT) 

;  Shifter  used  to  route  immediates 
’(shift_2  40  SHIFT) 

;  Shifter  used  to  align  data  from  memory 

> 

;  LOAD  Instruction 

’(dmux_l  1  DMUX) 

’(pads_l  70) 

;  delay  to  pins 
’(mem-1  300  MEM) 

’(dimm  1) 

;  to  store  data  from  memory 
;  see  shift_2,  dst_l 

;  STORE  Instruction 

) 

(resource-limits 

’(MEM  1) 

’(ALU  1) 

’(DST  1) 

’(DMUX  1) 

’(RW  1) 

’(RFP  1) 

’(SHIFT  1) 

) 

(paths 

’T1 

1.0 

1 

;  First  Ifetch 

f 

’(dmux_0  nxtpc)  ’(pads_0  dmux_0) 

’(mem_0  pads_0)  ’(ir  mem_0) 

;  Reg  File  to  ALU 
> 

’(rr-dec  ir) 

’(rr  rr-dec) 

;  Read  involved  regs  -  address  for  LD/ST 
’(src  rr) 
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;  temp  latch 
’(ai  it) 

’(bi  rr  shift — 1) 

’(alu  ai  bi) 

’(shift-l  ir  ) 

;  immediates  routing 

;  LOAD  instruction 

’(dmux_l  dst_0) 

’(pads_l  dmux_l) 

’(mem_l  pads_l) 

’(dimm  mem_l) 

’(shift_2  dimm) 

’(dst_l  shift-2) 

;  STORE  instruction 

’(dst_0  alu  shift_0) 

;  SHIFT  instruction 

’(shift_0  ir  src) 

;  Dest  Write  into  RegFile  -  r/r  instr 

’(rw-dec-0  ir) 

’(rw_0  dst_0  alu-dum_0) 
’(alu-dum_0  rw-dec_0) 

;  Dest  write  into  regfile  -  load  instr 

’(rw-dec_l  ir) 

’(rw_l  alu-dum_l  dst_l) 

’(alu-dum— 1  rw-dec_l) 

) 


% 

% 

„  NOTE:  we  are  now  going  to  run  SP  on  the  RISC-II  Data  Flow  Graph. 
%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20.5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 

=  >  (load  ’r21-dfg] 

;;  Loading  file  "r21-dfg.l" 
t 

=  >  (setq  part  (optO  450] 

(wa!12  wall3  wall4  wall5  wall6) 

=  >  (optl  part] 
curcost  =  671 
curcost  =  591 
591 

=  >  (optl  part] 
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curcost  =  501 
curcost  =  423 
423 

=  >  (optl  part] 
curcost  =  422 
422 

=  >  (opt2  part] 
cure  =421 
421 

=  >  (opt2  part] 

421 

=  >  (optl  part) 

421 

--  NOTE:  best  partition.  Cycle  time  =  421ns. 
=  >  (mape  ’print-wall  part] 

WALL  wal!2: 


arc63 

S-  >  nxtpc 

WALL  wa!13: 

arc37 

ir-  >  shift_0 

arc35 

ir-  >  rw-dec_l 

arc36 

ir-  >  rw-dec_0 

arc39 

ir-  >  rr-dec 

arc38 

ir-  >  shift_l 

WALL  wall4: 

arc35 

ir-  >  rw-dec_l 

arc36 

ir-  >  rw-dec_0 

arc55 

shift_0->dst_0 

arc46 

alu->dst_0 

WALL  wall5: 

arc65 

rw_0-  >  E 

arc53 

dst_l->rw_l 

arc42 

rw-dec_l  -  >  alu-dum_l 

WALL  wall6: 

arc65 

rw_0-  >  E 

arc67 

rw_l->E 

(wal!2  wa!13  wall4  walls  wallG) 

--  NOTE:  we  are  going  to  run  PC/PA  now 

=  >  (go-phase-calc  part] 
data  path  file  name  ? 
r21.1 

(setq  datapath-file  ’r21.1) 
mini  #phases  ? 

4 

temp  file  for  phase  parms  ? 

r21.phase-calc-temp.l 

% 


%  cat  r21. phase-calc-temp. 1 
(setq  datapath-file  ’r21.1) 
(setq  GLphase-calc ’t) 

(setq  PCminiphis  ’4) 
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(nodes 
’(S  1) 

’(E  1) 

’(nxtpc  1) 

’(dmux_0  1  DMUX) 
’(pads_0  70) 

'(mem_0  300  MEM) 
’(ir  1) 

’(rr-dec  90) 

’(rw-dec_0  90) 
’(rw-dec_l  90) 

’(rr  160) 

’(rw_0  140  RW) 

’(rw_l  140  RW) 

’(alu  170  ALU) 
’(alu-dum_0  170  ALU) 
’(alu-dum_l  170  ALU) 
’(ai  1) 

’(bi  1) 

’(dst_0  1  DST) 

’(dst_l  1  DST) 

’(src  1) 

’(shift_0  40  SHIFT) 
’(shift_l  40  SHIFT) 
’(shift_2  40  SHIFT) 
’(dmux_l  1  DMUX) 
’(pads_l  70) 

’(mem_l  300  MEM) 
’(dimm  1) 

) 


(resource-limits 

’(DMUX  1) 
’(MEM  1) 
’(RW  1) 
’(ALU  1) 
’(DST  1) 
’(SHIFT  1) 
’(RFP  1) 

) 


(paths 

’Pg 

1.0 

’(E  shift_0  dst_l  alu  rw_l  rw_0  rw-dec_l  ir) 
’(nxtpc  S) 

’(dmux_0  nxtpc) 

’(pads_0  dmux_0) 

’(mem_0  pads_0) 

’(ir  mem_0) 

’(rr-dec  S) 

’(rw-dec_0  S) 

’(rw-dec_l  S) 

’(rr  rr-dec) 
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’(rw_0  dst_0  alu-dum_0) 
’(rw_l  alu-dum_l) 

’(alu  ai  bi) 

’(alu-dum_0  rw-dec_0) 
’(alu-dum_l  S) 

’(ai  rr) 

’(bi  rr  shift_l) 

’(dst_0  S) 

’(dst_l  shift-2) 

’(src  rr) 

’(shift_0  src) 

’(shift_l  S) 

’(shift_2  dimm) 

’(dmux_ 1  dst_ 0) 

’(pads_l  dmux_l) 
’(mem-1  pads_l) 

’(dimm  mem_l) 

) 


% 


% 

%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20.5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 

=  >  (load  ’r21.phase-calc-temp] 

;;  Loading  file  "r21.phase-calc-temp.l" 
t 

=  >  (setq  phase-part  (optO  120] 

(wall2  wall3) 

=  >  (optl  phase-part] 

—  NOTE:  we  have  four  phases  of  90ns  each,  here, 
phase-calc-cost:  (exec.refine  r21.1  90  90  90  90) 

I>  PAchk  [clock.l]:  CPU  time  for  preprocessing:  3.26666536  seconds 
;;  Loading  file  "result. refined" 
phase-calc-cost  510 


=  > 

Exiting... 

%  lisp 

Franz  Lisp,  Opus  43.1  [sun-20.5] 

(C)  Copyright  1985,1986,1987  Franz  Inc.,  Alameda  Ca. 
=  >  (load  T21.phase-calc-temp] 

;;  Loading  file  "r21.phase-calc-temp.l" 
t 

=  >  (setq  part2  (optO  200] 

(wall2  wa!13  wall4  wall5) 


=  >  (optl2  part2) 


--  NOTE:  four  phases  of  170ns  each. 

-  NOTE:  predictably,  this  is  not  a  good  solution. 

phase-calc-cost:  (exec.refine  r21.1  170  170  170  170) 

I>  PAchk  [clock. 1]:  CPU  time  for  preprocessing:  3.48333194  seconds 
;;  Loading  file  "result.refine.l" 
phase-calc-cost  710 


-  NOTE:  phases  of  different  lengths  -  not  acceptable  for  implementation. 

phase-calc-cost:  (exec.refine  r21.1  340  161  170  223) 

I>  PAchk  [clock.l]:  CPU  time  for  preprocessing:  3.54999858  seconds 
;;  Loading  file  "result.refine.l" 
phase-calc-cost  523 
curcost  =  523 


Exiting... 


Appendix  4  -  Control  Synthesis  examples 

We  show  how  pipeline  control  logic  can  be  synthesized  in  two  different  styles  using  PA  and 
SCHED  for  a  small  example.  Our  objective  here  is  to  detail  the  steps  that  lead  to  the 
definition  of  the  control  machine. 

The  example  consists  of  MOS  dynamic  gates,  which  require  precharging.  The  blocks 
PI  and  P2  represent  the  precharging  of  gates  Gl  and  G2,  respectively.  The  blocks  LO,  LI 
and  L2  are  the  stage  latches.  We  start  with  the  PA  step;  the  cycle  time  is  100ns,  with  a 
two-phase  clock  at  50ns  per  phase.  Fig.a4.1  is  a  diagram  of  the  example;  the  files  are  in 
Fig.a4.4. 

The  following  pages  show  the  result  from  PA;  the  PA  step  is  trivial  on  this  small 
example;  our  purpose  is  to  describe  control  synthesis  techniques  here.  We  also  show  the 
output  from  SCHED.  The  resulting  initiation  sequence  corresponds  to  a  loop  of  minimal 
length  in  the  pipes  state  graph  (cf.  [Kogge  81]  and  this  report,  chap.  4).  We  show  this 
state  graph  for  the  example. 

This  pipe  can  use  either  time-stationary  or  data-stationary  control  [Kogge  81].  In 
Data-Stationary  control,  the  control  signals  flow  through  the  pipeline  along  with  the  data 
they  control.  The  control  logic  is  therefore  trivial.  We  simply  ship  control  signals  to  enable 
the  gates  and  latches  every  time  an  initiation  is  made.  Fig.a4.2  shows  the  control  logic  for 
the  initiation  sequence  produced  by  SCHED.  The  tradeoff  here  is  that  data-stationary  con¬ 
trol  requires  extra  latches  to  correctly  time  the  flow  of  control  signals  through  the  pipe. 

The  second  control  style  is  time-stationary  control.  In  this  scheme,  all  the  gate  con¬ 
trol  and  latch  control  signals  are  generated  centrally.  The  controller  therefore  has  to  know 
where  there  is  valid  data  in  the  pipe  at  every  instant.  Using  this  information,  the  con¬ 
troller  generates  the  proper  signals  to  time  the  flow  of  data  through  the  pipe.  Each  exact 
pattern  of  valid  data  in  the  pipeline  corresponds  to  one  state  in  the  pipes  state  diagram 
[Kogge  81]. 
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FIG  A4.1:  A  TUTORIAL  EXAMPLE 


FIG  A4.2:  DATA-STATIONARY  CONTROL 


STAGE#1 

CONTROLS 

A 

CONTROL 

INPUTS 


DELAY  LATCH  (L3)  -DELAY  LATCH  (L4) 

•  •  •  CONTROL  “FLOWS”  WITH  DATA 
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We  can  describe  the  controller  as  a  two-level  machine,  shown  on  Fig.a4.3.  The  top- 
level  machine  is  the  initiation  engine.  The  next  machine  is  the  state  decoder.  The  initia¬ 
tion  engine  decides  when  new  data  should  enter  the  pipe.  Its  output  consists  of  two  signals: 
one  signal  indicates  that  a  new  datum  should  enter  the  pipe  (LOAD),  and  the  other  is  the 
number  of  the  new  state  the  pipeline  configuration  will  be  entering  (these  are  the  states 
from  the  pipe’s  state  diagram).  The  state  decoder  machine  takes  a  pipeline  state  number  as 
input.  This  state  number  is  sufficient  to  fully  specify  where  valid  data  is  located  in  the 
pipe.  The  output  of  the  state  decoder  is  all  the  control  signals  required  to  enable  the  vari¬ 
ous  gates  and  latches  of  the  pipeline. 

In  the  case  of  static  [Kogge  81]  pipes,  the  initiation  engine  can  be  a  simple  FSM.  This 
FSM  can  be  generated  automatically  as  follows.  We  note  that  the  transition  diagram  of 
this  FSM  is  exactly  the  state  diagram  of  the  pipeline.  We  then  feed  the  pipe’s  state 
diagram  to  an  FSM  synthesis  tool  such  as  KISS  [DeMicheli  85]  or  MUSTANG  [Devadas  87] 
to  generate  the  initiation  engine. 

For  dynamic  pipes  that  support  multiple  initiation  patterns,  this  engine  could  be  a 
fairly  complex  microcoded  machine  that  decides  when  to  accept  new  data  based  on  current 
resource  usage  and  scheduling  priorities.  For  instance,  the  Burroughs  BSP  Supercomputer 
[Hockney  81]  has  a  very  complex  control  engine  that  dynamically  schedules  instructions 
onto  its  vector  pipeline  to  maximize  throughput.  The  BSP  can  be  viewed  as  a  pipeline  that 
executes  instructions. 

The  state  decoder  can  be  viewed  as  a  combinational  logic  function.  Each  state 
number  corresponds  to  a  pattern  of  valid  data  present  in  the  pipe.  The  output  from  OPD 
shows  what  this  pattern  looks  like,  as  in  Fig  a4.4.  This  pattern  is  sufficient  to  specify  the 
state  decoder.  We  generate  "enable”  signals  for  a  particular  stage  during  a  particular  clock 
cycle  if  there  is  a  number  in  the  entry  that  corresponds  to  that  stage/cycle  pair  in  the  valid 
data  pattern  calculated  by  OPD.  Fig.a4.5  shows  this  in  more  detail;  it  also  shows  how  we 
can  then  generate  a  FSM  or  PLA  to  perform  the  state  decoding 
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FIG  A4.4 


;  Small  Pipe  example  for  control  synthesis  description 
;  The  Pipe  is: 

LO  -  PI  -  Gl  -  LI  -  P2  -  G2  -  L2 

where  the  Li  are  stage  latches,  the  Pi  are  precharge  blocks 
;  and  the  Gi  are  dynamic  gates. 

(phases  50  50) 

(blocks 

’(LO  LATCH  0) 

’(PI  DYNAMIC  50) 

’(Gl  DYNAMIC  50) 

’(LI  LATCH  0) 

’(P2  DYNAMIC  50) 

’(G2  DYNAMIC  100) 

’(L2  LATCH  0) 

) 


(path 

’(PI  L0)  ’(Gl  PI)  ’(LI  Gl)  ’(P2  LI)  ’(G2  P2)  ’(L2  G2) 

) 


(constraints 
’(EQPHI  L0  LI  L2) 
’(NEQPHI  Gl  PI) 
’(NEQPHI  G2  P2) 

) 


(stages 
’(L0  LI) 
’(LI  L2) 
) 


Output  from  PA: 


t.o 

-1.1.50 

pi 

0.0.0 

Gl 

.0.1.0 

T.1 

0.1.50 

P2 

1.0.0 

02 

1.1.0 

L2 

.2.1.0 

Fire  cycle. phase. offset 


Output  from  SCHED,  the  Reservation  Table  Scheduler: 
restab  unit:  70 

scheduler:  '/work/sched.hu/all  -t  3  -s  2 


The  reservation  table  being  optimized: 
X- 
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-XX 


The  forbidden  latency  set  contains  time  slot: 

0  1 

The  range  of  MAL  is  :  2  <  =  MAL  <  =  2 

The  optimal  sequence  is  as  follows  : 

OPTSEQ  1(2) 

The  order  of  states  are: 

110 

OPTMAL  2 

Reservation  Table  at  each  State: 

StateO 

2-3 

122 

From  this,  we  extract  the  steady-state  reservation  table: 
(please  refer  to  the  text) 

w-i 

ixixi 

This  corresponds  to  the  first  two  cycles  of  the  table  above. 
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FIG  A4.5:  TIME- STATIONARY  CONTROL 


X,0,*:  SUCCESSIVE  INITIATIONS. 


CYCLE  NO. 

STAGE  #1 
STAGE  #2 


STATE  DECODER  SPECIFICATION 

CYCLE  3  =>  ENABLE  STAGE#]  AND  STAGE#2 

CYCLE  2  =>  DISABLE  STAGE#2,  ENABLE  STAGE#2 

THERE  IS  ONLY  ONE  PIPE  STATE  HERE 

THE  PIPE  STATE  REPEATS  ITSELF  EVERY  TWO  CLOCKS 


THE  SPEC  IS  FED  TO  A  LOGIC  SYNTHESIZER 


